Software for Constructing and Verifying Pedigrees within Large Genealogies and an Application to the Old Order Amish of Lancaster County Richa Agarwala,1,2 Leslie G. Biesecker,1 Katherine A. Hopkins,3 Clair A. Francomano,4 and Alejandro A. Schaffer1,5 1Laboratory for Genetic Disease Research (LGDR)/National Human Genome Research Institute (NHGRI)/National Institutes of Health (NIH), Bethesda, Maryland 20892 USA; 2(under contract to R.O.W. Sciences, Inc., Rockville, Maryland); 3Immunogenetics Laboratory, Division of Medical Genetics, Department of Medicine, Johns Hopkins University, School of Medicine, Baltimore, Maryland 21205 USA; 4Medical Genetics Branch (MGB)/NHGRI/NIH, Bethesda, Maryland 20892 USA, and Center for Medical Genetics and Department of Medicine, Johns Hopkins University, School of Medicine, Baltimore, Maryland 21205 USA; This paper describes PedHunter, a software package that facilitates creation and verification of pedigrees within large genealogies. A frequent problem in medical genetics is to connect distant relatives with a pedigree. PedHunter uses methods from graph theory to solve two versions of the pedigree connection problem for genealogies as well as other pedigree analysis problems. The pedigrees are produced by PedHunter as files in LINKAGE format ready for linkage analysis. PedHunter uses a relational database of genealogy data, with tables in specified format, for all calculations. The functionality and utility of PedHunter are illustrated by examples using the Amish Genealogy Database (AGDB), which was created for the Old Order Amish community of Lancaster County, Pennsylvania. Many genetic studies require linkage analysis, a key step toward positional cloning of disease suscepti- bility genes. Linkage analysis requires describing pedigrees for a set of people exhibiting a specific trait and verifying relationships in pedigrees (Poly- meropoulos et al. 1996; Khoury et al. 1987). If a pedigree is implicitly part of a large genealogy, it is possible to use computer science techniques to find it. There may be many potential (sub)pedigrees that connect the individuals with the trait, and it is de- sirable to find a (sub)pedigree that is optimal in a rigorous mathematical sense. This paper describes software tools, called PedHunter, to keep genealogi- cal data in a relational database and to analyze the data systematically. PedHunter solves two distinct formulations of the optimal pedigree connection problem, as well as other problems related to pedi- gree construction, verification, and analysis. Closed populations are a rich source of material for medical genetics studies. For example, closed populations provide powerful data sets to map re- cessive disease genes by homozygosity mapping (Kruglyak et al. 1995). Some closed populations have compiled genealogies because of their inter- ests, and these genealogies enhance the utility of the populations for genetic studies. The Old Order Amish of Lancaster County in Pennsylvania is one such population. Most ancestors of the Amish in the United States migrated in the early eighteenth cen- tury from Germany and Switzerland. They settled in Berks County, Pennsylvania, and gradually radiated to other parts of Pennsylvania, Ohio, Indiana, Illi- nois, and Iowa. Some settled in Lancaster County, in about 1795. Information about Amish pedigrees of Lancaster County is available in a book called the Fisher Family History (Fisher Family History 1988), also known as the ‘‘Fisher book,’’ which traces the lineage of the Amish in magnificent detail. In spite of the rich information in this book, it is not easy to use in medical genetics for a number of reasons. (1) The book (1988 version) has ∼550 pages of informa- 5Corresponding author. E-MAIL schaffer@helix.nih.gov; FAX (410) 550-7513. RESEARCH 8:211–221 ©1998 by Cold Spring Harbor Laboratory Press ISSN 1054-9803/98 $5.00; www.genome.org GENOME RESEARCH 211 Cold Spring Harbor Laboratory Press on April 5, 2021 - Published by genome.cshlp.orgDownloaded from http://genome.cshlp.org/ http://www.cshlpress.com tion in small font. (2) The organizers grouped infor- mation about extended families in the book. This ordering is useful for Amish readers interested in their proximal genealogy but is problematic for the construction of extended pedigrees needed for ge- netic analysis. (3) Many people have the same first and last name. (4) There are no pedigree diagrams; only text. (5) Some information about a person may appear in several places because that person has his/ her own record and also appears in the records of parents and children. Data are sometimes inconsis- tent across multiple entries. Because of these limitations, tracing family re- lationships for genetic studies requires laborious searching. This paper describes a database called Amish Genealogy Database (AGDB) for the geneal- ogy of the Old Order Amish of Lancaster County, Pennsylvania. AGDB contains information from the 1988 version of Fisher Family History that has been corrected for errors and contains additional infor- mation compiled subsequent to 1988. Keeping the data in AGDB and querying the data with PedHunter facilitates usage of the genealogy by medical geneti- cists. Beyond the Old Order Amish, PedHunter is a general tool that can be used to analyze any gene- alogy. The use of PedHunter requires only a database with information for that population. For reasons of confidentiality no phenotype information is stored in the database. Phenotype information known only by the user may be helpful in posing geneti- cally interesting queries. Tools from Computer Science This section summarizes relevant data storage and data analysis tools from the domain of computer science and suggests how they can be used to make a genealogy database. The data in AGDB are stored in a relational database (Date 1990). PedHunter is de- signed to use relational databases developed in SYBASE. SYBASE provides an implementation of Structured Query Language (SQL) useful for answer- ing simple queries and an interface with the C pro- gramming language useful for implementing more complex queries. Data Organization Each record in the Fisher book contains information about a nuclear family. For example (information in this example of a family record is fictitious, but the syntax is accurate), 894. Noah Karp b Jan. 8, 1938, d May 9, 1988, farmer, m Dec. 24, 1957, Mary Green (4211) b Sep. 12, 1940, d 1962. Children: Victor (895), Sarah (4200), Amy. m2 1964, Annie Green (4211) b Aug. 23, 1942. Children: Henry. 895. Victor Karp (894) b May 24, 1959, m June 6, 1982, Mildred Sweeney. 4211. John Green b Dec. 24, 1920, m Sarah Fisher. Children: Annie (894), Mary (894). The data were obtained as an ASCII text file with some typesetting commands inserted. Using standard methods from the theory of formal lan- guages these family records were parsed and col- lated into a record of the information for each per- son. Each record in the Fisher book contains the fol- lowing pieces of information relevant to genealo- gies: ● Name of family head; e.g., Noah Karp, Victor Karp, and John Green in the above three examples, re- spectively. ● A number called the Fisher number or family number. For the above examples, the Fisher num- bers are 894, 895, and 4211, respectively. ● If the parents of a family head are known, the record contains the Fisher number for that family; e.g., record 895 refers to record 894, Victor’s par- ents. ● List of spouses. For each spouse, the record gives the name and attempts to give a Fisher number in following order: (1) Fisher number for the family in which spouse is the family head; (2) Fisher number for the family in which spouse is a child. For example, in record 894 Noah is married first to Mary and then to Annie Green; the Fisher number 4211 following those names refers to the family in which they were children. ● List of children of each spouse. This list contains names of children and attempts to give a Fisher number for each child in the following order: (1) Fisher number for the family in which the child is the family head, e.g., 895 for Victor Karp in record 894; (2) Fisher number for the family in which the child is a spouse, e.g, 894 for both Annie and Mary in record 4211. Data Inconsistencies The analysis of data with a relational database re- quires (1) resolving or deleting inconsistencies in the data, (2) designing the database, (3) creating the AGARWALA ET AL. 212 GENOME RESEARCH Cold Spring Harbor Laboratory Press on April 5, 2021 - Published by genome.cshlp.orgDownloaded from http://genome.cshlp.org/ http://www.cshlpress.com tables, and (4) writing programs to analyze the da- tabase. Of these four steps, the only task that is spe- cific to a population is deleting or resolving incon- sistencies. The other three tasks are greatly facili- tated by using a commercial SQL implementation. Finding inconsistencies is feasible if the raw geneal- ogy data have redundant and conflicting informa- tion. Because of the structure of the records in the Fisher book, the detectable inconsistencies include parent–child relationships and husband–wife rela- tionships. Some inconsistencies are caused by infor- mation that is present in one record but absent in another; it is assumed that the information is cor- rect in this case. Other inconsistencies are appar- ently caused by typographical errors in Fisher num- bers, in which case it may be straightforward to decide which numbers are (in)correct. A dispropor- tionate number of the errors occur in the data en- tered after 1988, presumably because those data have not been published and hence have not been proofread by large numbers of Amish. In addition, the number of recorded births, deaths, and mar- riages per year dropped precipitously after 1988, suggesting that much recent information is missing. Efforts have been initiated with the Amish commu- nity to correct remaining errors and update records. Database Design Each person in the database is assigned a unique integer, called the program id. There are two cen- tral tables in the database—the person table and the relationship table. The person table has information exclusive to the individuals, and the relationship table keeps information relating the individuals by containing an entry for each couple and a list of their children, if any, in terms of their program id. Please see Appendix for details of database de- sign. Analyzing the Database After the database is created, the next step is to use analytic tools to extract useful information. The analytic tools in PedHunter rely heavily on graph theory (Even 1979). In the Appendix, the notion of a graph is introduced along with related terminol- ogy. RESULTS AND EXAMPLES This section describes the queries available in PedHunter and shows the utility of PedHunter with a usage of AGDB on a published pedigree. PedHunter supports functions for finding relatives of individu- als (father, mother, siblings, half-siblings, uncles and aunts, children, first cousins, ancestors) as well as functions for testing if two individuals are related according to a specified relationship. The more complicated queries supported by PedHunter are as follows: ● lca(S): [lowest common ancestors for S] This query finds the most recent common ancestors of the given set of people S; the set of most recent common ancestors of S is the set of people A such that (1) everyone in A is an ancestor of everyone in S, and (2) no one in A has a descendant who is also an ancestor of everyone in S. ● asp(S): [all shortest paths pedigree for S] If there is at least one common ancestor for the given set of people S, then for each lowest common ances- tor a, asp(S) finds the pedigree containing all shortest paths between a and each person in S. PedHunter contains an implementation of Dijk- stra’s Algorithm (Dijkstra 1959) for finding all shortest paths. PedHunter takes a few seconds for creating modest size pedigrees (e.g., connecting 85–90 individuals resulting in a pedigree of ∼500 individuals). This is a significant improvement over the tens of hours typically required to create such pedigrees by manually searching the Fisher book. ● minped(S): [minimal pedigree for S] If there is at least one common ancestor for the given set of people S, then for each lowest common ancestor A, minped(S) finds a pedigree connecting A and each person in S such that (1) the pedigree has the minimum number of vertices, and (2) every path from A to each person in S is the shortest path. The problem in finding a minimal pedigree for S is the Steiner tree problem, with asp(S) being the given graph and S being the set of vertices to be connected. PedHunter uses a ‘‘branch and bound’’ algorithm for finding a minimal pedigree (de- scribed in the Appendix). ● pedigree length(X,N): Prints a pedigree contain- ing all descendants of X who are at most N gen- erations apart from X. For example, pedigree- length(X,2) lists all children and grandchildren of X. ● kinship(X,Y): Computes the kinship coefficient between X and Y where kinship(X,Y) is the prob- ability that a randomly selected allele of X and a randomly selected allele of Y are identical by de- scent. PedHunter uses the algorithm in Weir (1996, pp. 204–206). SOFTWARE FOR CONSTRUCTING PEDIGREES GENOME RESEARCH 213 Cold Spring Harbor Laboratory Press on April 5, 2021 - Published by genome.cshlp.orgDownloaded from http://genome.cshlp.org/ http://www.cshlpress.com ● inbreeding(X): Computes the inbreeding coeffi- cient for X where inbreeding(X) is the probability that the two alleles of X are identical by descent. Because X inherits a randomly selected allele from mother and a randomly selected allele from fa- ther, inbreeding(X) = kinship(X’s mother, X’s father). PedHunter uses the algorithm in Weir (1996, pp. 204–206). ● print(X): Print all information about person X. All pedigrees are produced in LINKAGE format (Terwilliger and Ott 1994). The PedHunter output pedigree files can be read into some versions of pedi- gree/genotype maintenance programs such as CYRILLIC 2.0 (Chapman 1990) and Pedigree/Draw 5.0 (Mamelka et al. 1993). Two of the drawings here were produced by feeding PedHunter output into Pedigree/Draw 5.0. The all shortest paths pedigree may contain many connections between pairs of individuals, and it can be quite large. This pedigree may be too com- plicated for linkage analysis with current tools and may be impossible to draw comprehensibly (e.g., the drawing in McKusick et al. 1964). These two problems justify the desire for a minimal pedigree that is a subpedigree of the all shortest paths pedi- gree. A minimal pedigree can be used to explain most likely paths of inheritance of the mutant trait gene. A minimal pedigree can be used as a starting point for the geneticist to reinstate more individuals that are in the all shortest paths pedigree in an it- erative linkage analysis. To verify a given pedigree (P), one must verify all paths in P. A path is considered to be verified if the parent–child relationships represented by the edges on the path are consistent with the informa- tion in the database. The paths in P can be verified as follows: 1. Find the set of people S in P who do not have a child in P and have at least one parent in P. 2. Find asp(S). Paths in P that are present in asp(S) are verified. If all the paths in P are verified, the verification is complete. 3. For each ‘‘long’’ path p from u to v that is not yet verified, find asp(R) where the elements of R are u and v. If p is a path in asp(R), p is verified. If all the paths in P are verified, verification is complete. 4. For each edge u → v in P that is not yet verified, do the query is child(v,u), which returns TRUE if v is a child of u and FALSE otherwise. If it returns FALSE, the edge is in error. [It can be corrected by using query features father, mother, ancestor, chil- dren, as appropriate for the pedigree. For in- stance, if confidence about v is high, u can be corrected by using father(v) and mother(v).] It is not practical to have a query feature for verifying an entire pedigree at once because it is difficult to determine what to do when there are multiple related errors in the pedigree. In many cases, the corrections would depend on knowledge not present in the pedigree database. The utility of AGDB and PedHunter are demon- strated in an example concerning a genetic disorder currently under study. McKusick (1978) published a pedigree that connects the children of six couples. Field work and consultation with the authors of the original study yielded the names of the persons, so they could be accessed in AGDB. Three of the couples are present in the Fisher book and AGDB and will be used as a test example. Part of the pub- lished pedigree that connects the three couples that are in AGDB is shown in graph representation in Figure 1 and in pedigree representation in Figure 2. The pedigree in Figure 2 is designated MKped. Let S be the set with elements 10, 16, 22, 26, 27, 32, rep- resenting the couples that are in AGDB. Using lca(S), it turns out that there are five least common ances- tors for the three couples. Figure 3 shows the all shortest paths pedigree (ASPped) for the least com- mon ancestor that has the maximum number of vertices in common with MKped. A minimal pedi- Figure 1 Graph for the partial McKusick pedigree. AGARWALA ET AL. 214 GENOME RESEARCH Cold Spring Harbor Laboratory Press on April 5, 2021 - Published by genome.cshlp.orgDownloaded from http://genome.cshlp.org/ http://www.cshlpress.com gree with the same least common ancestor is shown in Figure 4. MKped (Fig. 2) can also be used to illustrate pedigree verification. The paths in MKped that are verified by ASPped (Fig. 3) are as follows: (1) path from 6 to 10; (2) path from 14 to 16; (3) path from 19 to 22; (4) path from 23 to 26 and 27; and (5) path from 29 to 32. According to ASPped 19 and 23 are siblings, whereas in MKped 18 and 23 are siblings and 19 is a child of 18. This discrepancy is due to an error in MKped—person 18 should not be present in MKped. The following relationships are verified using is child/siblings query features: (1) 5 is the father of 6; (2) 13 is the father of 14; (3) 17 is the mother of 19 and 23; (4) 28 is the mother of 29; and (5) 17 and 28 are siblings. DISCUSSION Much attention has been paid toward database management and analysis [e.g., GDB (Fasman et al. 1997), GenBank (Benson et al. 1997), SWISS-PROT (Bairoch and Apweiler 1997)] of genomes as well as to making extensive family trees [e.g., GEDCOM (1997), GED2HTML (1997)] but there is no tool suf- ficient for a medical geneticist who has a large genealogy as well as genetic information for some of the individuals. Existing software packages [e.g., CYRILLIC (Chapman 1990), DGENE (Lange et al. 1988), Pedigree/Draw (Mamelka at al. 1993), PEDRAW/WPEDRAW (Curtis 1990), PhenoDB (Cheung et al. 1996), SCHESIS (Round 1989)] man- age genetic information, store pedigrees, and draw pedigrees but do not automatically extract or verify pedigrees from a genealogy. PEDSYS (Dyke 1992) supports pedigree extraction but not based on the type of optimality criteria implemented in PedHunter. PedHunter provides tools to organize ge- nealogical information in a database and a query facility to access the database. This functionality can create and verify pedigrees according to criteria specified by the user. Advantages of keeping data in a relational database include the following: ● Simplicity: The concept of a relation provides a self-contained means of representing all types of data relationships using simple tables. ● Flexibility and Restructuring: There is complete flexibility in database design. Relations can be combined or broken down in other relations with ease. ● Accessibility and Protection: Keeping data in a rela- tional database provides variable and ready access Figure 3 All shortest paths pedigree (ASPped). Figure 2 Partial McKusick pedigree (Mkped). SOFTWARE FOR CONSTRUCTING PEDIGREES GENOME RESEARCH 215 Cold Spring Harbor Laboratory Press on April 5, 2021 - Published by genome.cshlp.orgDownloaded from http://genome.cshlp.org/ http://www.cshlpress.com to all data. Data can be protected and can be used by multiple users. ● Update: Updating the information in a database is simple because one can verify whether informa- tion is consistent. It is important to keep data current for future use. ● Optimization: A good database design does not du- plicate information, which reduces data storage and redundancy. ● Easy Storage: Physical data storage and logical da- tabase structure are independent. There is no worry about where and how the files are stored and they can be accessed using the relational tables. The usage of PedHunter should be extensible to other existing genealogies besides the Fisher book: (1) There are other genealogy books of the Old Or- der Amish that overlap with the Fisher book; (2) there are genealogy books of other Amish commu- nities from the midwestern United States; (3) there are genealogy books of other closed North American communities, such as Hutterites and Mennonites; (4) there are genealogies loosely compiled for sev- eral island populations and for parts of Finland. Ef- forts have been initiated to obtain some of these genealogies, to obtain permission to computerize them, and to get the raw data into a computer- readable text file. To use PedHunter on a genealogy, one need only create a database with two tables— person table and child parent table, as described earlier and use the utility program provided to cre- ate generation table. PedHunter is freely avail- able to scientists studying populations with written genealogies. Interested scien- tists should send an electronic mail request to richa@helix. nih.gov or to schaffer@helix. nih.gov. AGDB is intended to be available to scientists con- ducting National Cancer In- stitute Internal Review Board (IRB)-approved research on the Old Order Amish. Access to AGDB is carefully con- trolled and limited by IRB protocol. In the initial phase, the IRB approved usage by five groups that have proto- cols approved by their IRBs to study the Amish. Each user group is represented by at least one coinvestigator on the protocol for AGDB. More user groups meeting these criteria can be added. The hunt for disease susceptibility genes has be- come increasingly automated. One step that has been lacking automated tools is that of connecting distant relatives with the same phenotype into a pedigree for linkage analysis. When the desired pedigree is implicitly part of a large, existing gene- alogy, computer science techniques can be adapted to extract the pedigree or to verify a putative pedi- gree systematically. PedHunter provides this automa- tion. The utility of PedHunter is demonstrated by using it to query a database, AGDB, of a large Old Order Amish genealogy. These tools were used to verify and correct parts of a previously published pedigree. The combination of AGDB and PedHunter are immediately useful in ongoing genetic studies of this population (e.g., Stone et al. 1998). Conversations with geneticists who use the Fisher book suggest that they are content when they can find any pedigree to connect a specified set of individuals. There is no reason to believe that the first pedigree found is the best. In contrast, the all shortest paths and Steiner criteria that PedHunter of- fers are mathematically rigorous and genetically sensible. METHODS AGDB was created in SYBASE SQL Server release 11.0.x of the SYBASE database management soft- Figure 4 Minimal (Steiner) pedigree. AGARWALA ET AL. 216 GENOME RESEARCH Cold Spring Harbor Laboratory Press on April 5, 2021 - Published by genome.cshlp.orgDownloaded from http://genome.cshlp.org/ http://www.cshlpress.com ware. The functions are implemented using Trans- act-SQL and C version of Open Client DB-Library. PedHunter was developed on Unix System V release 4.0 running under SunOS 5.5.1 using Sun Work- Shop Compiler C 4.2 but it is compatible with other Unix computers. All of the PedHunter functions are available as executable files that can be used from the command line prompt; AGDB is available to sci- entists conducting IRB-approved research on the Old Order Amish. The automatic compilation of the Fisher book genealogy into AGDB was approved by the National Cancer Institute IRB at NIH. ACKNOWLEDGMENTS This research was performed within the Division of Intramu- ral Research (NHGRI). We thank Dr. Robert Nussbaum for frequent and active encouragement, for helpful guidance about the conventions of genetics research, and for sugges- tions on this manuscript. We also thank Katie Beiler and our other Amish colleagues and friends in Lancaster County. We are grateful to Gunther Birznieks and Ed Whitley for assis- tance with SYBASE. We thank Dr. Cyril Chapman, Kevin Crawford, and Dr. Paul Mamelka for their help with CYRILLIC and Pedigree/Draw. We thank Carol Machan for providing the raw data files for AGDB. The research of K.A.H. was sup- ported by a grant (NIH P01HG00373) from the National Cen- ter for Human Genome Research. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked ‘‘advertisement’’ in accordance with 18 USC section 1734 solely to indicate this fact. REFERENCES Bairoch, A. and R. Apweiler. 1997. The SWISS-PROT protein sequence data bank and its supplement TrEMBL. Nucleic Acids Res. 25: 31–36. Benson, D.A., M.S. Boguski, D.J. Lipman, and J. Ostell. 1997. GenBank. Nucleic Acids Res. 25: 1–6. Chapman, C.J. 1990. A visual interface to computer programs for linkage analysis. Am. J. Med. Genet. 36: 155–160. Cheung, K.-H., P. Nadkarni, S. Silverstein, J.R. Kidd, A.J. Pakstis, P. Miller, and K.K. Kidd. 1996. PhenoDB: An integrated client/server database for linkage and population genetics. Comp. Biomed. Res. 29: 327–337. Curtis, D. 1990. A program to draw pedigrees using LINKAGE or LINKSYS data files. Ann. Hum. Genet. 54: 365–367. Date, C.J. 1990. An introduction to database systems (Vol. 1). Addison-Wesley, New York, NY. Dijkstra, E.W. 1959. A note on two problems in connection with graphs. Num. Math. 1: 269–271. Dyke, B. 1992. PEDSYS: A pedigree data management system. Population Genetics Laboartory, Southwest Foundation for Biomedical Research, San Antonio, TX. Even, S. 1979. Graph algorithms. Computer Science Press, Rockville, MD. Fasman, K.H., S.I. Letovsky, R.W. Cottingham, and D.T. Kingsbury. 1997. The GDB(TM) Human Genome Database. Nucleic Acids Res. 25: 72–80. Fisher Family History. 1988. First ed. (1957) compiled by John M. Fisher family, published by Amos L. Fisher. Second ed. (1972) edited by Janice A. Egeland, published by Moore Clinic, Johns Hopkins Hospital, Baltimore, MD. Third ed. (1988) revised and updated by a committee of Fisher descendants (ed. Katie Beiler) Eby’s Quality Printing, PA. GEDCOM. 1997. GEDCOM Coordinator-3T, Family History Department, Salt Lake City, UT 84150. e-mail: gedcom@gedcom.org. GED2HTML: A GEDCOM to HTML Translator. 1997. Eugene W. Stark, Port Jefferson, NY 11777. e-mail: stark@cs.sunysb.edu. URL: http://www.gendex.com/ged2html. Hwang, F.K. and D.S. Richards. 1992. Steiner tree problems. Networks 22: 55–89. Khoury, M.J., B.H. Cohen, E.L. Diamond, G.A. Chase, and V.A. McKusick. 1987. Inbreeding and prereproductive mortality in the Old Order Amish. III. Direct and indirect effects of inbreeding. Am. J. Epidemiol. 125: 473–483. Kruglyak, L., M.J. Daly, and E.S. Lander. 1995. Rapid multipoint linkage analysis of recessive traits in nuclear families, including homozygosity mapping. Am. J. Hum. Genet. 56: 519–527. Lange, K. and R.C. Elston. 1975. Extensions to pedigree analysis. I. Likelihood calculations for simple and complex pedigrees. Hum. Hered. 25: 95–105. Lange, K., D. Weeks, and M. Boehnke. 1988. Programs for pedigree analysis: MENDEL, FISHER, and dGENE. Genet. Epidemiol. 5: 471–472. Mamelka, P.M., B. Dyke, and J.W. MacCluer. 1993, 2nd ed. Pedigree/Draw for the Apple Macintosh, Population Genetics Laboratory Technical Report No. 1. Southwest Foundation for Biomed. Research, San Antonio, TX. McKusick, V.A. 1978. The William Allan Memorial Award Lecture: Genetic nosology: Three approaches. Am. J. Hum. Genet. 30: 105–122. McKusick, V.A., J.A. Egeland, R. Etheridge, and D.E. Krusen. 1964. Dwarfism in the Amish. I. The Ellis-van Creveld syndrome. Bull. Johns Hopkins Hosp. 115: 306–336. Papdimitriou, C.H. and K. Steiglitz. 1982. Combinatorial optimization: Algorithms and complexity. Prentice-Hall, Englewood Cliffs, NJ. SOFTWARE FOR CONSTRUCTING PEDIGREES GENOME RESEARCH 217 Cold Spring Harbor Laboratory Press on April 5, 2021 - Published by genome.cshlp.orgDownloaded from http://genome.cshlp.org/ http://www.cshlpress.com Polymeropoulos, M.H., S.E. Ide, M. Wright, J. Goodship, J. Weissenbach, R.E. Pyeritz, E.O. Da Silva, R.I. Ortiz De Luna, and C.A. Francomano. 1996. The gene for the Ellis-van Creveld syndrome is located on chromosome 4p16. Genomics 35: 1–5. Round, A.P. 1989. SCHESIS: Computation in a multipoint linkage analysis program. Prog. Clin. Biol. Res. 329: 69–74. Stone, D., R. Agarwala, A.A. Schäffer, J.L. Weber, D. Vaske, T. Oda, S.C. Chandrasekharappa, C.A. Francomano, and L.G. Biesecker. 1998. Genetic and physical mapping of the McKusick-Kaufman Syndrome. Hum. Mol. Genet. 7: 475–481. Terwilliger, J.D. and J. Ott. 1994. Handbook of human genetic linkage. The Johns Hopkins University Press, Baltimore, MD. Weir, B.S. 1996. Genetic data analysis II. Sinauer Associates, Sunderland, MA. Zelikovsky, A. 1997. A series of approximation algorithms for the acyclic directed Steiner tree problem. Algorithmica 18: 99–110. Received September 22, 1997; accepted in revised form February 6, 1998. APPENDIX This Appendix describes the database design, graphs and related terminology, and method used by PedHunter to solve the minimal pedigree, or Steiner pedigree problem. Database Design Each person in AGDB is assigned a unique integer, called the program id. Utility programs were writ- ten to translate from a Fisher id to the program id of the first person in that record, and back from a program id to the Fisher id of one record where that person is listed. There are two central tables in the database—the person table and the relationship table. The person table (person table) has informa- tion exclusive to the individuals. The columns of this table are program id, Name, Birth date, Death date, Address, Gender, Special status, Number of spouses, and Other Information. The Special status field contains information about whether a person is a twin, triplet, foster child, or adopted child. The relationship table (child parent table) keeps in- formation relating the individuals. It contains an entry for each couple and a list of their children, if any, in terms of their program id specified in per- son table. The columns of child parent table are program id of father, program id of mother, Marriage date, and program id of each child. The last field is a list of program ids of each child, if the couple has any. As each row/column combina- tion can have only one value, the list is kept as a string with special delimiters between children. PedHunter contains a utility program for creat- ing generation table, which contains the maxi- mum generation level for each person relative to other persons in the database. The two columns of this table are program id of the individual and Maximum generation level. The generation table is used in computations including inbreeding and kinship coefficients. Graphs and Related Terminology A graph is a mathematical representation of objects and the relationship between pairs of objects. Some standard graph terminology is defined below and illustrated by references to the graph in Figure 1. The set of objects represented in a graph are called its vertices. If two objects are related, they are joined by an edge. For convenience, the following notation is used: ● V(G) represents the set of vertices for graph G. ● E(G) represents the set of edges for graph G. ● e is u–v if edge e joins vertices u and v and the order of u and v is not important. Such an edge is called an undirected edge. ● e is u → v if edge e joins vertices u and v and the edge is ordered from u to v. Such an edge is called a directed edge. ● v is in the set V(G) if v is a vertex in graph G. ● e is in the set E(G) if e is an edge in graph G. A graph is a directed graph if its edges are di- rected edges; otherwise, it is an undirected graph. A pedigree can be represented by a graph that has a vertex for each person and an edge for each child– parent pair among the persons in the pedigree. Sometimes it is desirable to make the graph directed by directing each edge from the vertex for parent to the vertex for child. For example, the pedigree (Fig. 2) that is used later as an example, can be repre- sented as in Figure 1. It is useful to view a pedigree as a graph because it traces the passage of genetic information over generations and provides a conve- nient mathematical representation. A graph repre- senting a pedigree is called a pedigree graph. For the pedigree graph in Figure 1, the set of vertices V(G) is {1,2,. . .,32}, and there are 32 directed edges, includ- ing 6 → 7 and 17 → 23. A path from vertex u to vertex v in G is an alter- AGARWALA ET AL. 218 GENOME RESEARCH Cold Spring Harbor Laboratory Press on April 5, 2021 - Published by genome.cshlp.orgDownloaded from http://genome.cshlp.org/ http://www.cshlpress.com nating sequence of vertices and edges of G, begin- ning with u and ending with v, such that no vertex is repeated and every edge joins the vertices imme- diately preceding it and following it. A path p con- necting u to v is denoted p = u, u1, u2, . . ., un, v, where u1, u2, . . ., un is the sequence of other vertices on the path. For example, 1 → 3 → 4 → 5 is a di- rected path in the graph of Figure 1. The undirected sequence 14—13—12—11—2—3—4—5 is a path in the undirected version of this graph, but not in the directed version. The two paths are denoted as p = 1, 3, 4, 5, and p = 14, 13, 12, 11, 2, 3, 4, 5, respectively. A directed path p from u to v in a pedigree graph traces one way that v receives genetic material from u. The vertices on p are also a (partial) list of ances- tors of v. Absence of a directed path from u to v indicates that u is not an ancestor of v. Hence, there is no way that v can receive genetic material from u; for example, in Figure 1, person 28 does not receive genetic material directly from person 6, although they have a common ancestor. A pedigree graph G connects a set of individuals S if there is a vertex in V(G) for each individual in S, and in the undirected version of the graph, there is a path between each pair of vertices representing people in S. The pedigree of Figure 1 is connected. Geneticists usually use the term pedigree to mean a set of parent–child relationships that induce a pedi- gree graph whose undirected version is connected. Therefore, any formal definition of pedigree construc- tion should require that the output pedigree induce a connected pedigree graph. However, this require- ment does not specify the output pedigree uniquely, and different specifications of pedigree construction are useful for different purposes. A cycle is a path that begins and ends at the same vertex. If a graph has no cycles, it is called acyclic. Any directed pedigree graph is acyclic, as the edges are directed from parent to child, and it is biologi- cally impossible to have a cycle in this setting. The term loop is not used here because formally a loop in a pedigree is a cycle in a more complicated, undi- rected representation called a marriage graph (Lange and Elston 1975). For example, a first-cousin mar- riage with offspring leads to a loop in the undirected marriage graph but not to a cycle in the directed pedigree graph. The length of a path is the number of edges in the sequence defining the path; for example the length of path p = u0, u1, u2, . . ., un is n. A path from vertex u to vertex v is a shortest path if it has mini- mum length of any path from u to v. The length of a shortest path from u to v is also called the distance d(u,v) from u to v. A shortest path from u to v makes the fewest assumptions about how an allele could have passed from u to v. By the principle of parsimony, a hy- pothesis that a disease-causing allele passed from u to v requires consideration of all shortest paths from u to v. This observation motivates the asp pedigree construction query. If G and H are graphs, then H is a subgraph of G if and only if V(H) is a subset of V(G) and E(H) is a subset of E(G). Given a pedigree graph G that connects a set of people S, it is sometimes useful to find a subgraph of G that still connects S. One such situation is when the pedigree graph has too many edges to be drawn comprehensibly. A subgraph of G that contains minimum number of vertices and connects S is called a minimal subgraph, and the problem of find- ing a minimal connected subgraph is called a Steiner tree problem in the computer science literature. Therefore, a minimal pedigree subgraph that con- nects specified individuals is called a Steiner pedigree. PedHunter finds Steiner pedigrees with the query called minped. Steiner pedigrees are useful for draw- ing and for linkage analysis when the pedigree graph has too many edges. PedHunter’s method to construct Steiner pedigrees and some background from the computer science literature are explained in the next section. Steiner Pedigree Steiner tree problems are widely studied due to ap- plications such as designing road networks and VLSI chips. A survey of Steiner tree problems can be found in Hwang and Richards (1992). Many Steiner tree problems are quite intractable computationally, but pedigree graphs have special features that may make the problem easier. Pedigree graphs are acy- clic, and the vertices that must be included typically represent children at the bottom generation. Such vertices that have no outgoing edges are called ter- minals. In a recent paper Zelikovsky (1997) studied the Steiner tree problem for all directed acyclic graphs with the constraint that the vertices to be connected are terminals. This problem has applica- tions outside genetics. Zelikovsky showed that this version of the Steiner tree problem is intractable in a formal sense and gave an approximation algorithm for the problem. An approximation algorithm finds a subgraph that contains all of the prescribed vertices whose size is bounded relative to the size of the minimal subgraph, where the size of a graph is the number of vertices in the graph. However, both the performance ratio and the computation time of this SOFTWARE FOR CONSTRUCTING PEDIGREES GENOME RESEARCH 219 Cold Spring Harbor Laboratory Press on April 5, 2021 - Published by genome.cshlp.orgDownloaded from http://genome.cshlp.org/ http://www.cshlpress.com approximation algorithm grow exponentially with the length of the longest path in the graph. Instead of Zelikovsky’s approximation algo- rithm, PedHunter uses an algorithm that finds an exact smallest subgraph (subpedigree) as measured by the number of vertices (individuals). The ap- proximation algorithm may be inadequate for the subsequent linkage analysis. The exact algorithm applies a general technique called branch and bound, which is commonly used to explore exponential-size search spaces in intractable search problems (Papdimitriou and Steiglitz 1982). Branch and bound works well in problems where one can quickly get to a good solution. The weight of the known solution can be used to eliminate many partial solutions whose weight is larger (for a minimization problem such as Steiner tree). When a partial solution is eliminated, all full solutions con- taining that partial solution are implicitly elimi- nated also. Branch refers to steps that expand the current partial solution by branching in the search space; bound refers to using a known solution as a bound. Branch and bound algorithms start by finding an initial solution, keep track of the current best solution, and probe other solutions until it is pos- sible to determine that the current solution is at least as good as any other solution. Branch and bound algorithms typically use retracing or back- tracking, which means that part of the current (par- tial solution) is removed, and the algorithms pro- ceed to consider other solutions that do not contain the removed part. To describe a specific branch and bound algorithm one must describe (1) What makes one solution better than another? (2) How to find an initial solution? (3) How to probe all solutions? For the minimal pedigree problem, the algorithm solves the three basic questions as follows. Comparing Solutions The number of edges in the solution is the measure for comparing solutions and the solution with fewer number of edges is better. The number of edges in a solution is called the cost of the solution. The prob- lem statement asks to minimize the number of ver- tices, but any connected graph with no cycles that has e edges, must have exactly e+1 vertices (Even 1979). Therefore, minimizing number of edges is equivalent to minimizing number of vertices. The algorithm transforms the initial Steiner problem where all edges have weight 1 to a related problem where edges may have larger weight. In the weighted version, the cost of a solution is the sum of weights of edges the solution contains. Initial Solution The steps for finding an initial solution are as fol- lows: 1. Start with the specified set of individuals as the vertices in the solution. 2. For each vertex v in the solution, except the root, that is not connected to a parent, choose a parent p that is present in the input graph. If the parent chosen is not in the solution, add a vertex for p in the solution. Add the edge p → v to the solution. 3. Repeat step 2 until there is only one vertex in the solution that is not connected to a parent. That vertex is the root of the pedigree. Probing All Solutions Every edge in the input graph G is either in the cur- rent solution or it is out. The edges are numbered 1, . . .,|E(G)| such that edge u → v is preceded in the order by all edges that are incoming into u. Such an order is possible for pedigree graphs because they are acyclic (Even 1979). Consider the edges in order e = 1, 2, 3, . . . . Add edge e to the current solution unless it (1) creates an undirected cycle or (2) creates a (partial) solution with higher cost than the current best. When the current partial solution is a full solution (detectable by testing whether number of edges equals number of vertices minus 1), compare its cost to the current best. There are four cases that need to be specified further. 1. If the current partial solution is a full solution and it has lower cost than the previous best, re- place the best solution and retrace decisions back to the last edge added in. Omit that edge and proceed to look for better solutions. 2. If the addition of e creates an undirected cycle, then omit e and proceed to add more edges. 3. If the addition of e creates a partial solution with higher cost than the current best full solution, then as in case 1, retrace back to the last edge added in. Omit that edge and proceed forward. 4. If the last edge was considered and the solution is not full, then as in case 1, backtrack to the last edge added in, omit it and proceed. All solutions have been considered when retracing returns all the way past edge 1. AGARWALA ET AL. 220 GENOME RESEARCH Cold Spring Harbor Laboratory Press on April 5, 2021 - Published by genome.cshlp.orgDownloaded from http://genome.cshlp.org/ http://www.cshlpress.com The worst-case computation time of the algo- rithm is exponential in the number of edges in the input graph. The computation time is lowered sub- stantially by reducing the size of the graph to be considered in the branch and bound algorithm. The graph is reduced by three transformations that are applied repeatedly in any order until none of them can be applied: 1. Replace paths consisting of k + 1 vertices with one incoming edge and one outgoing edge with a single edge of cost k. This is valid because in the original graph, either all k edges must be in- cluded or excluded from a solution. 2. Delete unselected terminals with one incoming edge. This is valid because they cannot serve to connect the selected terminals. 3. Delete edge u → v if there is another path from u to v. This is valid because the input graph is an all shortest paths pedigree. Therefore, all paths from u to v have the same cost. The vertices on the multiple-edge path cannot lead to a worse solu- tion than using the single edge u → v and might lead to a better solution. The first transformation introduces edges whose weight is >1. The third transformation can apply only after the first transformation has created edges of weight >1 because the cost of all paths from u to v is the same. The cost of a solution in the transformed graph is the sum of costs of all edges in the solution. The transformation is reversed to pro- duce the output Steiner pedigree. SOFTWARE FOR CONSTRUCTING PEDIGREES GENOME RESEARCH 221 Cold Spring Harbor Laboratory Press on April 5, 2021 - Published by genome.cshlp.orgDownloaded from http://genome.cshlp.org/ http://www.cshlpress.com 10.1101/gr.8.3.211Access the most recent version at doi: 1998 8: 211-221 Genome Res. Richa Agarwala, Leslie G. Biesecker, Katherine A. Hopkins, et al. County Lancaster Genealogies and an Application to the Old Order Amish of Software for Constructing and Verifying Pedigrees within Large License Service Email Alerting click here.top right corner of the article or Receive free email alerts when new articles cite this article - sign up in the box at the https://genome.cshlp.org/subscriptions go to: Genome Research To subscribe to Cold Spring Harbor Laboratory Press Cold Spring Harbor Laboratory Press on April 5, 2021 - Published by genome.cshlp.orgDownloaded from http://genome.cshlp.org/lookup/doi/10.1101/gr.8.3.211 http://genome.cshlp.org/cgi/alerts/ctalert?alertType=citedby&addAlert=cited_by&saveAlert=no&cited_by_criteria_resid=protocols;10.1101/gr.8.3.211&return_type=article&return_url=http://genome.cshlp.org/content/10.1101/gr.8.3.211.full.pdf https://genome.cshlp.org/subscriptions http://genome.cshlp.org/ http://www.cshlpress.com