David O. Nelson Evaluating genotype-phenotype relationships in complex data sets with missing data **************************************************************** One of the challenges of the post-genomic era will be to apply the vast quantities of emerging human sequence data to public health problems. Cheap re-sequencing technologies, coupled with developing technologies in rapid genotyping, will drive the development of methods to predict the susceptibility of population strata to outcomes such as cancer incidence or survival in response to environmental exposures of interest. LLNL is developing and analyzing a data set that attempts to enumerate the variation in DNA repair genes in a collection of human cell lines and associate each cell line with a phenotype that provides an integrated, end-to-end measure of reduced repair capacity in response to ionizing radiation. This data set is being used to develop a simple scoring system that will associate genotypes with diminished DNA repair capacity and, in the future, with increased risk for clinical outcomes of interest. This data set is characterized by high dimensionality, ordinal predictors,and multiple outcome measurements. One of its more interesting features is its incompleteness: due to the vagaries of current genotyping methods, very few cell lines contain complete information on all genotypes. In this talk, we will report on progress in developing algorithms that search for simple functions of combinations of interesting genotypes that predict reduction in DNA repair capacity. These algorithms exploit R on high-performance multiprocessors. They evaluate potentially interesting combinations of genotypes by resampling methods such as those developed by Breiman, Fridlyand, and others, and develop scoring functions by heuristic search methods similar to Logic Regression.