Dr. Yan Wong University of Oxford
Dr. James Prendergast Roslin Institute, University of Edinburgh
Dr. Máire Ní Leathlobhair Trinity College Dublin
Dr. Nicholas McGranahan CRUK-UCL
Dr. Nabil-Fareed Alikhan Quadram institute
Professor Carole Goble The University of Manchester
Professor Sandrine Dudoit University of California, Berkeley
Often, scientific advances are made when the assumptions of underlying models are found to be inadequate. I will focus on an example from evolutionary genomics, where the assumptions underlying two foundational concepts (the evolutionary “tree”, and the interbreeding “population”) have been undermined by the advent of whole genome analysis. I will present a brief history of these concepts and their role in the development of biological thought, before describing how their inadequacies can start to be addressed by a recent genealogical approach, primarily devised to enable efficient computer simulation. This approach, implemented in our “tree sequence” software toolkit, promises to bring together the study of evolution on different timescales. Focussing on the structure needed to capture results from different statistical and computational models has forced us to examine what these genealogies actually represent. I will argue that our genealogies describe the basic biological processes of mitosis and meiosis, and are therefore less abstract than previous descriptions of the evolutionary process, although several improvements in our approach remain to be be made.
Dr. Yan Wong , University of Oxford
Yan is evolutionary geneticist with an interest in a wide range of biological problems. After a DPhil in Plant Sciences at Oxford, he collaborated with Richard Dawkins to write The Ancestor’s Tale, a comprehensive history of life in reverse time. This was followed by a period of time as a lecturer at the University of Leeds, then as a TV and radio presenter, most notably on the BBC One show Bang Goes The Theory.
Yan currently works at Oxford University’s “Big Data Institute”
Many studies have focused on identifying disease risk factors. However, whether these risk factors causally affect the disease, and if so, whether their effects are direct or mediated by other factors, are often poorly understood. Knowing causal risk factors and the directionality of their relationships is crucial to understand disease mechanisms and identify “at risk” groups. Mendelian randomization (MR) is a popular method of utilising genetic variants as instruments to investigate causal relationships between risk factors and disease outcomes in observational studies. It mimics randomized controlled trials by assuming that each individual’s genes were inherited randomly from their parents. MR sidesteps the issue of confounding (both observed and unobserved) and reverse causation. There has been a large growth in the application of MR approaches to health research. Most MR methods require parametric models. When sample size and/or the number of variables increases, it is likely that this approach becomes computationally intractable. Network analysis, mainly using machine learning algorithms (e.g. Bayesian network), allows for inclusion of many risk factors and a disease outcome simultaneously in a single model, with the aim to identify direct and indirect effects of the risk factors on the outcome. An advantage of this approach is that it does not require a pre-specified parametric model or impose restrictive assumptions. Hence, they can potentially reduce bias compared to mis-specified complex models. Network analysis is an effective way of exploring data structure by discarding redundant variables (e.g., the PC algorithm) after testing for marginal and conditional (in) dependence properties between each pair of variables. However, the relationships these networks depict are often associations/correlations, and thus, do not necessarily have a causal interpretation. In this talk, I will introduce MR and network analysis, followed by discussion on how one can take forward strengths of both MR and machine learning for causal network analysis.
Acute leukaemia is characterised by the disruption of haematopoietic differentiation due to genetic alterations in factors that control developmental processes. For example, key epigenetic regulators like Polycomb Repressive Complex 2 (PRC2) are frequently mutated or deleted in T-cell acute lymphoblastic leukaemia (T-ALL) and acute myeloid leukaemia (AML), leading to decreases in repressive H3K27 methylation across the genome. While haploinsufficiency of PRC2 core components EZH2, SUZ12 and EED correlates with poor outcome in childhood leukaemia, the precise molecular mechanism underpinning aggressive disease biology is poorly understood. Additional to its canonical function, PRC2 has been reported to directly alter chromatin conformation and enhancer-promoter contacts at topologically associating domains (TADs) in other cellular contexts, but a similar role in leukaemia has not been explored. Here, to understand the consequences of PRC2 loss-of-function in AML, we took a systematic approach and analysed transcriptomics, proteomics, CRISPR-screen data and drug screen results in 26 AML cell lines from the DepMap resource in the context of PRC2 alteration. Of note, mutation/deletion of any PRC2 core components typically resulted in reduced protein abundance of EZH2, EED, SUZ12. We uncovered specific drug sensitivities in PRC2-lost cell lines, such as higher sensitivity to JAK2 inhibitors. Additionally, we detected SLC9B1 as a novel dependency in PRC2-depleted cancer cells. PRC2-deficient cells were also found to be less dependent on SP1, a cancer-associated transcription factor which regulates cell growth, apoptosis and other cellular processes. To gain further insights into the mechanism of altered gene expression in AML, we developed an in vitro model of PRC2 loss via CRISPR-Cas9 editing of PRC2 core component EZH2 in the OCI-AML2 cell-line. We performed chromosome conformation capture (Hi-C) and RNA-seq on WT OCI-AML2 and EZH2-deficient OCI-AML2. This allowed us to understand the dual role of PRC2 in gene regulation: as an epigenetic regulator via H3K27 methylation and its atypical role in chromatin conformation. Early analysis of our Hi-C data has revealed differential TADs between WT and EZH2-depleted OCI-AML2, suggesting that PRC2 deficiency leads to altered 3D genome architecture and formation of aberrant enhancer-promoter contacts. We are currently linking these findings to transcriptional outputs to understand whether these changes may underpin key oncogenic transcriptional drivers of this disease.
To a genomics researcher a cow and a human are more similar than different. Both species have similar size genomes and gene number, can be afflicted by similar diseases and traits, and biological processes from gene regulation to body development are highly conserved across them. These similarities make humans a good model organism for the cow and means techniques and approaches developed for humans can rapidly be applied to cattle, while likewise findings from cattle research can potentially be used to improve human health. In this talk I will discuss some of the research we are undertaking bridging the gap between human and cattle research. Illustrating how we are applying approaches developed for humans to livestock research, such as graph genomes, how we are using the understanding of how functional variants conserved across both species can potentially be used to improve both human and cattle health and trying to understand the conservation across species of fundamental processes common to both species such as the regulation of DNA mutations.
Dr. James Prendergast , Roslin Institute, University of Edinburgh
James completed his PhD in bioinformatics and statistical genetics in 2007 from the University of Edinburgh, and following positions at the European Bioinformatics Institute and University College Dublin returned to Edinburgh to work first at the MRC Human genetics unit before joining the Roslin in 2013. James’s group is focused on understanding mammalian gene regulation, genome evolution and human and animal disease genetics.
Hydractinia is a highly regenerative animal, that shows no signs of age-related deterioration, develops no spontaneous neoplasia, and is highly resistant to ionizing irradiation (IR). These features are thought to depend on a population of adult pluripotent stem cells called i-cells and may indicate the presence of a highly stable genome in some or all Hydractinia cell types. Our work shows that Hydractinia possesses no unique protection against IR-induced DNA double-strand breaks (DSBs), and that cells clear γH2A.X foci within 24 hours, in-line with other organisms. However, in contrast to other animals, Hydractinia stem cells are not sensitive to IR. Following irradiation, cycling i-cells exit the cell cycle for up to 9 days. The animals continue growing, which appears to depend on i-cell migration and differentiation only. We hypothesise that i-cells possess a novel mechanism for the maintenance of a stable genome. To address this hypothesis, acetic acid methanol (ACME) whole animal maceration followed by split pool ligation barcoding (SPLiT) was used to generate single-cell RNA seq libraries from irradiated feeding polyps at 1 and 9 days post exposure to 50 Gy gamma irradiation. Using these libraries, we will investigate cell type-specific accumulation of mutations and differential gene expression at single cell resolution Understanding these mechanisms could provide insight into cnidarians' overall resilience to age-related degeneration and cancer. García-Castro, H., et al. (2021). ACME dissociation: a versatile cell fixation-dissociation method for single-cell transcriptomics. Genome Biology.
Pushing back the onset of age related degeneration, effectively extending the human ‘healthspan’, requires a strong foundation of knowledge on the mechanisms of aging. Comparative studies of long lived species is a powerful means to identify the genes and loci associated with the fundamental mechanisms of aging and aging avoidance. Bats are an excellent model, with 18 species being mammalian outliers in aging, displaying longevity much higher than expected from their body mass ratio. As such we collected an extensive collection of 210 transcriptomes from 114 individual bats (Myotis myotis bats, max lifespan 37 years) including longitudinal sampling to identify and characterise the transcriptomic and regulatory signatures associated with ageing in this remarkable species. Here we specifically focus on the dynamic expression and regulation of long non-coding RNA loci, which have been recognised as a class to be involved in gene expression regulation. We identified 14,392 novel lncRNAs in the M. myotis genome, including 13,582 identified as expressed in at least 2 individuals or 2 time points across individuals. These lncRNAs were investigated in the transcriptomes of longitudinally sampled bats (3-7 timepoints), with a clustering analysis to identify lncRNA with temporal expression patterns consistent between individuals. These consistent lncRNA loci will be primary candidates for further analysis, such as on their correlation with protein coding gene expression. Brief summary of work and key results By analyzing 210 transcriptomes from 114 individual bats, this research has identified and annotated long non-coding RNAs, assessed their expression and found lncRNA genes which display consistent patterns of expression across individuals as they age. Through elucidating the relationship of these consistent lncRNAs to coding gene expression their influence on the aging bats will be investigated.
Bovine TB (bTB), caused by infection with Mycobacterium bovis, is a major endemic disease affecting global cattle production. It’s human counterpart, Mycobacterium tuberculosis, is one of the top leading causes of death due to a single infectious agent worldwide. Both mycobacterial species share 99.5% genome sequence identity. The key innate immune cell that first encounters the pathogen in both humans and cattle is the alveolar macrophage, previously shown to be substantially reprogrammed during intracellular infection by the pathogen In this study, we compare the host transcriptional responses to both mycobacterial pathogens by analysing existing RNA-seq data extracted from four infected cell groups: 1) bAM infected with M. bovis (bAM-MB), 2) bAM infected with M. tuberculosis (bAM-MT), 3) human alveolar macrophages (hAM) infected with M. tuberculosis (hAM-MT), and 4) human monocyte-derived macrophages (hMDM) infected with M. tuberculosis (MDM-MT). These RNA-seq data were re-analysed using four different computational genomics analysis pipelines: 1) standard differential expression of genes (DEG), 2) differential expression interaction networks (DEN), 3) combined pathway analysis (CPA), and 4) Ingenuity Pathway Analysis (IPA). To identify common and distinct genomic variants in the bovine and human genome significantly associated with bTB disease resistance and human hTB disease resilience, the results of the four different analytical pipelines were integrated with two published GWAS datasets: 1) a bTB resistance GWAS study consisting of high-density genotypes for 7,346 bulls and epidemiological data from 781,270 cattle, and 2) a hTB case-control GWAS study consisting of 2,219 infected individuals and 450,045 non-infected controls. Using a combination of multi-omics analyses, integration of GWAS data and cross-species comparison, we prioritised a panel of 12 cattle genes containing, or in proximity to 224 intronic and exonic SNPs significantly associated with bTB disease resistance. We also identified 20 human genes containing/in proximity to 106 SNPs significantly associated with hTB disease resilience. Analysis of these 32 human and bovine gene loci revealed that for both species, genomic variants with disease resistance/resilience are located within genes that are core to granuloma formation, the NF-κB signalling pathway and cytokine receptor interactions. Overall, these findings highlight the marked commonality of the bovine and human host responses to tuberculosis disease and emphasise the importance of the bovine model for understanding human tuberculosis.
Le blianta beaga anuas, tá réabhlóid faisnéise feicthe againn i dtaighde ailse. Mar sin féin, tá cur i bhfeidhm léargais nua a fuarthas tri sheicheamhú géanóm ailse sa réimse cliniciúil sách teoranta. Leanann staidéir ghéanómacha ar aghaidh ag giniúint méid ollmhór data ach gan dóthain bealaí chun an t-eolas seo ar ghéineolaíocht ailse a aistriú go bealaí éifeachtacha chun othair ailse a chóireáil. Sa chaint seo, déanfaidh mé cur síos gearr ar cad is féidir a fhoghlaim ó phróifíliú ghéanom ailse iomlán agus conas is féidir le cláir náisiúnta um ghéanóim phoiblí cabhrú linn dul ó ghéanóm go clinic. Le heagrú forleathan data géanómaíochta agus cliniciúla ceadaítear modhanna meaisinfhoghlama a chur i bhfeidhm a d'fhéadfai forbairt a dhéanamh ar halgartaim chliniciúla a chun cabhrú le cliniceoirí agus le géanómaithe anailís agus léirmhíniú géanóim ailse aonair a fhorbairt. Le linn mo chuid cainte, déanfaidh mé tagairt do thaighde ó mo ghrúpa féin ar ailse magairli mar chuid den Tionscadal 100,000 Géanóim i Sasana agus i gcomhpháirtíocht leis an NHS. Chomh maith leis sin, pléifidh mé conas is féidir linn an bonneagar agus na creataí oideachais in Éirinn a chur chun cinn chun tacú le clár náisiúnta géanómaíochta poiblí agus le géanomaithe cliniciúla a ullmhú le haghaidh todhchaí ina mbeidh seicheamhú géanóm iomlánaíoch mar ghnáthchleachtas. Déanfaidh mé tagairt freisin don bhaint a bhíonn ag rannpháirtithe sa taighde, conas còir leighis pearsanta a ofráil do dhaoine le hailse, an gá atá le héagsúlacht san eoas géanómach agus agus comhtháthú teicneolaíochtaí nua cosúil le 'seicheamhú fada'.
Dr. Máire Ní Leathlobhair , Trinity College Dublin
Máire Ní Leathlobhair originally graduated from Trinity College Dublin with a BA in Mathematics. She completed her PhD in Biological Sciences under the supervision of Elizabeth Murchison at the University of Cambridge in 2018. Following three years as a Junior Research Fellow at the University of Oxford, Máire recently started her research group at the School of Genetics and Microbiology in Trinity College Dublin where she is HCI Assistant Professor in Biological Data Analytics. Currently, her main research interests lie in investigating the development of rare cancers.
Irish:
Le daichead bliain anuas, ba phríomhfhanacht é i dtaighde leighis bithmharcóirí a aimsiú ar féidir tairngreacht, fáthmheas agus tuaras a dhéanamh ar galar de gach uile short. In ailse próstatach, tá PSA in úsáid ó na 1990idí chun luathbhrath, atarlú agus a dtuar an galar. De réir mar a laghdaigh costas seicheamhú géiniteach le himeacht ama, tá méadú breise tagtha ar an féidearthacht a bhaineann le na bithmharcóirí. Toisc go mbíonn tionchar ag ailse phróstatach ar dhuine as seachtar fear in Éirinn agus go mbíonn tionchar ag atarlú suas le 40% in ainneoin idirghabháil mháinliachta, tá tábhacht ar leith ag baint le bithmharcóirí nua a aimsiú. Úsáidtear nomagraim thuartha faoi láthair chun cuidiú leis an atarlú a thuar ach is gnáthathróga cliniciúla amháin a mheastar. De bharr toise ard na sonraí géiniteacha, ní féidir staidrimh traidisiúnta amháin a úsáid don samhaltú agus teastaíonn scagadh, rud a d’fhéadfadh faisnéis ríthábhachtach a chailleadh. Níl an fadhb seo ag baint le teicnící meaisínfhoghlama. Chorpraithe sa samhla tá an cumas acu feabhas a chur ar an réamh-mheastacháin agus is é seo príomhfhócas an staidéar seo. Déanann an staidéar a cuireadh i láthair imscrúdú ar úsáid mRNA chun atarlú ailse próstatach a thuar iar-obráide le modhanna meaisínfhoghlama.
English:
For the past forty years, finding biomarkers which can be used in the prediction, diagnosis and prognosis of many diseases has been a main stay in healthcare research. In prostate cancer, prostate specific antigen (PSA) has been used since the 1990s for early detection of present, recurrence and prediction. As the cost of genetic sequencing has reduced over time due to improvements in technology, the feasibility of these biomarkers has further increased their potential for routine use. Prostate cancer affects one in seven men in Ireland, and is one of the most common cancers from men globally. With recurrence affecting up to 40% despite primary surgical intervention, finding new biomarkers is of particular importance in this field. Predictive nomograms are currently used to aid in the prediction of recurrence but the state-of-the-art tools only include routine clinical variables. To improve on this, genetic variable could be included. However, due to the high dimensionality of genetic data, traditional statistical approaches for prediction cannot be used for the modelling alone. To implement these common techniques pre-filtering and feature selection is required which may result in crucial information being lost. Machine learning techniques due to their nature do not have these issues and, incorporated in the model design have the potential to improve the predictive ability which is the main focus of this investigation. The study presented investigates the use of messenger RNA (mRNA) in prediction of prostate cancer recurrence post-operatively with the use of machine learning models with the aim to improve on the current state-of-the-art.
Irish:
Tá an córas imdhíonach níos mó ar intinn an phobail le roinnt blianta anuas de bharr na paindéime. Ach tá gnéithe éagsúla den chóras a théann i ngleic le pataiginí agus ní mar a chéile iad ó thaobh an chosaint a thugann siad ó ghalair ná an cumas a bhíonn ag an bpataigin éalú uathu. Le blianta beaga anuas tá níos mó airde á dhíriú ar an gcóras imdhíonach mar bhealach le dul i ngleic le hailse chomh maith agus tá torthaí an-dearfacha feicthe ar dhrugaí agus ar theicneolaíochtaí a luíonn leis an bprionsabal go bhfeidhmíonn an córas imdhíonach mar líne cosanta rí-thábhachtach in aghaidh ailse. Mar sin féin ní oibríonn na cóir leighis seo do gach othar ná do gach cineál ailse agus tá diantaighde á dhéanamh le tuiscint níos fearr a fáil ar na fáthanna atá leis seo. Sa chur i láthair seo, tabharfaidh mé faoi chur síos a dhéanamh ar phríomhghnéithe an chórais imdhíonaigh le béim ar an gcumas atá ag an gcóras idirdhealú a dhéanamh idir pataiginí agus cealla sláintiúla an choirp agus na bealaí a n-éalaítear ó na freagairtí céanna. Déanfaidh mé tagairt d'obair i mo ghrúpa taighde féin a cheistíonn an ceangal idir an córas imdhíonach agus na cineálacha sócháin a fheictear in ailse. Cúpla bliain ó shin tháinig roinnt foilseacháin shuntasacha ar an bhfód a mhaígh go mbraitheann na sócháin a fheictear in othair le hailse ar ghéinitíopa an chórais imdhíonaigh. Thaispeánamar nach bhfuil ceangal láidir eatarthu, áfach, rud a fhágann ceisteanna móra faoin ról atá ag an gcóras imdhíonach mar shraith cosanta in aghaidh ailse.
English:
The pandemic has probably made most of us a little more aware of our immune systems, its role in protecting us from disease and the capacity of viruses to escape from it. However, the immune system is multifaceted and its distinct arms can have different implications in terms both of the protection they offer from disease and the capacity of pathogens to evolve mechanisms of immune evasion. Even before the pandemic the human immune system was receiving increased attention from researchers as a consequence of its potential for the treatment of cancer, with immune-based therapies showing spectacular successes against several important cancer types. In a sense, these novel treatments are a natural extension of the theory of cancer immunosurveillance, which proposed that the immune system plays a major role in preventing the development of cancer in the first place. In this talk I will give a general overview of the key components of the human immune system, with an emphasis on the mechanisms that are the basis of distinguishing pathogens and defective cells from healthy cells and the ways in which these mechanisms are subverted. I will discuss some recent research that suggests that the mutations that cause cancer are influenced by the genotype of genes that control how the immune system functions and touch on some work in my own research group that has questioned this link between immune genotype and cancer driver mutations. Ultimately the extent to which the mutations that drive the development of cancer are shaped by our immune systems remains an important open question and the subject of some very interesting ongoing research.
Is iad próitéiní na móilíní a chomhlíonann an chuid is mó de na próisis bhitheolaíochta sa chill. Feidhmíonn géinte mar theimpléid do phróitéiní, agus tá an t-eolas seo traschurtha ó chill go cill, ó ghlúin go glúin. Bíonn ar an ngéin a bheith tras-scríofa go tras-scríbhinn, agus ansin aistrithe isteach i bpróitéin sula bhfuil sé ábalta feidhmiú sa chill. Glaoitear leiriúcháin géine ar seo. Toisc go bhfuil neart fuinneamh agus móilíní éagsúla ag teastáil bíonn an próiseas seo dianrialaithe ag an chill. Go minic teipeann ar an rialú seo I ngalair agus cruthaítear próitéiní a chomhlíonann próisis éagsúla ag an am mícheart m.sh. in ailse bíonn próisis éagsúla a bhaineann le méadú na gceall de shíor ag feidhmiú. Mar sin má tá tuiscint ag taighdeoirí ar na géinte atá léiriúcháin difriúil acu I ngalar, beidh siad ábalta na próisis atá as ord a aithint agus leigheasanna a fhorbairt bunaithe ar an eolas seo. I neart cásanna bíonn na céadta géin le hathraithe suntasach ina gcuid léiriúcháin idir gnathshamplaí agus shamplaí le galar, an iomarca le moinscrúdú a dhéanamh ar gach géin. Seachas a bheith ag tástáil na géin go haonarach táimid ábalta iad a bhailiú le chéile bunaithe ar a fheidhm, bitheolaíocht nó ceimic, agus an tástáil a rith ar an ngrúpa géin. Glaoitear anailís tacar géine ar seo. Tá cúpla buntáiste le seo, laghdaíonn sé líon na tástálacha atá le rith, comhcheanglaíonn sé eolas bitheolaíochta isteach san anailís, agus cabhraíonn sé le ciall a dhéanamh as na torthaí. Tá modh nua á chur chun cinn againne a sheachnaíonn tástáil ar ghéinte aonarach go hiomlán. Chomh maith le sin táimid ábalta an éiginnteacht maidir leis ár meastachán den athrú léiriúcháin a thógáil isteach san áireamh. Feabhsaíonn an chur chuige seo ár gcumas tacar géinte suntasach a aithint.
Cé go bhfuil dul chun cinn iontach déanta le déanaí maidir le géanóim iomlána a sheicheamhú go tapa agus go héasca, cuireann an t-ualach sonraí a ghintear dúshlán suntasach léirmhínithe roimh chliniceoirí agus eolaithe cliniciúla. Tá iliomad athraitheach ag gach aon duine, nach chúisíonn galar ar bith, agus neart eile arís gan suntas cliniciúil soiléir. Anuas air sin, tá ceisteanna eiticiúla ann i gcónaí maidir le tuairisciú na n-athraitheach a aimsítear a chúisíonn galair, ach nach bhfuil aon bhaint acu le siomptóim an othair, nó na hathraithigh a aimsítear i bpáistí nach bhfuil éifeacht acu ach i ndaoine fásta. Leis na fadhbanna seo a shárú, tá sé mar nós anois ag saotharlanna géineolaíocha diagnóiseacha seicheamhú a dhéanamh ar an easóm cliniciúil; ‘sé sin réigiúin chódála an ngéin a bhfuil ceangal aitheanta acu le galair ghéiniteacha. Leis an ualach oibre anailíse a laghdú a thuilleadh, ní lorgaítear athraithigh ach i bhfothacar géinte, ar a dtugtar painéal fíorúil, atá réamhshocraithe ag braith ar shiomptóim an othair. ‘Sé an buntáiste le painéil fhíorúla a úsáid ná go bhfuil ábhar na bpainéal seo solúbtha, agus gur féidir a thuilleadh géinte a chuir leo má aimsítear ceangal idir eatarthu agus galar faoi leith. Is fusa i bhfad ansin seicheamh othair gan diagnóis a athanailísiú, ná sampla breise a iarradh ón othar, agus athsheicheamhú a dhéanamh ar a DNA; próiseas a chruthódh costas agus obair bhreise don saotharlann, anuas ar mhoill shuntasach le diagnóis a bhaint amach. Thosaigh na ceithre shaotharlann atá mar bhaill de Chuibhreannas Shaotharlanna Géineolaíocha na hAlban (Scottish Genetics Laboratory Consortium) seicheamhú easóim chliniciúil a chur ar fáil d’othair in 2019, leis an tástáil do ghalair faoi leith roinnte eatarthu ag braith ar an saineolas atá ar fáil i ngach saotharlann. Roimhe sin, bhí ar an gCuibhreannas an tástáil seo a iarraidh ó shaotharlanna i Sasana, a chuir go mór le costas na seirbhíse. Ó shin i leith, tá éirithe ag an seirbhís diagnóis a bhaint amach i neart cásanna coimpléascacha – a phléifear sa chaint seo – ar bhealach cost-éifeachtúil, a chabhraigh le comhairleoireacht agus tacaíocht a thabhairt d’othair agus a dteaghlaigh gan mórán moill.
Irish:
Tá Graif Eolais (GEanna) tar éis éirí an-choitianta i chomhthéacs bitheolaíochta agus bith leighis de bharr a gcumas eolas stóráil agus a nascadh le chéile go nádúrtha idir saghsanna eolais ar leith. Ach is fadhb é ag an am céanna nascadh forleathan na nGEanna: is rud é a chuireann isteach ar iarratais daoine a n-eolas a léamh agus a thuiscint gan chúnamh ó uirlisí ríomhaireachta. Rogha amháin chun déileáil leis an méid eolais sin ná meaisínfhoghlaim, samhlacha Leabuithe Graf Eolais (LGEanna) go háirithe. Bíonn siad in ann ní hamháin an t-eolas a chur ar fáil ar fhoirm níos éasca do thaighdeoirí, ach tá siad in ann agus fíricí nua a réamhinsint ón mbun-ghraf. Bíonn deacrachtaí fiú leis an gcur chuige sin, áfach: ní thuigtear cén baint atá ann idir struchtúr GEanna éagsúla agus feidhmiú chórais LGEanna. Móide sin, is beag an taighde atá déanta ar cé chaoi codanna éagsúla LGEanna a chur le chéile ná ar cé chaoi a hipearpharaiméadair a roghnú de réir sonraí an graif amháin seachas trí chuardach mór a dhéanamh. Sa taighde seo, déantar cur síos ar an idirghníomhú idir 1) struchtúr GEanna, 2) samhlacha LGE agus a hipearpharaiméadair, agus 3) feidhmíocht chórais atá bunaithe ar LGEanna. Ar anailísíocht a dhéanamh ar LGEanna i gcomhthéacs trí líonra idirghníomhaithe próitéin, faightear amach go bhfuil samhlacha agus hipearpharaiméadair iomlán ar leith de dhíth ar thacair eolais éagsúla, rud nach suntasach. Ach faightear amach chomh maith go bhfuil samhlacha agus hipearpharaiméadair ar leith de dhíth ar ghraif eolais a bhfuil an t-eolas ceannann céanna isteach iontu, ach a bhfuil struchtúir éagsúla acu. Thairis sin, léiríonn torthaí an staidéir seo go bhfuil an-bhaint ag lárnacht nód i ngraf eolais ar cé chomh maith is a bhíonn feidhmíocht córais LGEanna. Faightear feidhmíocht i bhfad níos láidre nuair atá lárnacht níos mó ag nóid an ghraif. Mar aon leis sin, bíonn feidhmiú i bhfad níos measa ag córais LGEanna i gcomhthéacs graif a bhfuil nóid le lárnacht bheag acu. Ar deireadh, déantar conclúid gur ceart radharc bunaithe ar struchtúr an eolas a úsáid chun úsáid LGEanna a éascú agus chun iad a dhéanamh níos éifeachtaí.
English:
Knowledge Graphs (KGs) have recently become very popular in the context of biology and biomedicine because of their ability naturally to store and link various types of data. But at the same time, this presents a problem for biological KGs – it makes it harder for humans to read the data they contain without the help of computational tools. One choice for dealing with this is machine learning, in particular Knowledge Graph Embeddings (KGEs). They are able to not only present the data in a form more easy to be read by researchers, but can predict new facts from the base graph. But even that approach has issues: there is a lack of understanding regarding the interactions between KG structure and KGE performance. On top of that, little research has been done on how to assemble the various pieces of KGE models or on how to determine hyperparameters from the base data rather than with a large search. In this work, a description is given on the interaction between 1) KG structure, 2) KGE models and their hyperparameters, and 3) the performance of KGE-based systems. Through an analysis of KGEs on protein interaction networks, it is found that very different KGE models and hyperparameters are needed for different data, as expected. However, it is also found that very different models and hyperparameters are needed for KGs containing the exact same data, but with different structures. On top of that, the results of this research show that there is a strong influence of node centrality on how well KGEs perform. Much higher performance is obtained when nodes have overall higher centrality. In tandem with that, performance is greatly reduced when more nodes in a graph have very low centrality. Finally, the work concludes that a graph-structure-based view of data would be effectively applied to determining how to use and optimise KGE models.
Cancer development within an individual is an evolutionary process. This has important clinical implications for cancer prevention and therapy, as well as our understanding of cancer progression and metastatic spread. In this talk, I will outline how we can exploit cancer genomic sequencing data to decipher cancer evolutionary histories and the extent of diversity within individual tumours. I will focus on lung cancer, the leading cause of cancer-related deaths worldwide. I will evaluate how tumours spread from the primary tumour to distant sites, and when this occurs during a tumour’s development. Finally, I will explore how we can use novel bioinformatics tools to shed light on the interface between the cancer cell and the immune microenvironment, and mechanisms of immune escape. I will explore how DNA sequencing data can be harnessed to identify T cells in tumour samples, and the clinical relevance of T cell infiltrate in predicting response to immunotherapy.
Dr. Nicholas McGranahan , CRUK-UCL
Dr Nicholas McGranahan completed his undergraduate degree in Natural Sciences, specializing in Evolutionary Genetics, at the University of Bath before pursuing post-graduate studies at University College London at the Centre of Mathematics and Physics in the Life Science and Experimental Biology (CoMPLEX). In 2011, Dr McGranahan joined Professor Charles Swanton’s group at the CR-UK London Research Institute (now the Francis Crick Institute), completing a PhD in Cancer Genomics in 2015.
Nicholas established his own research group in 2018 and as a Sir Henry Dale fellow at the UCL-CRUK Lung Cancer Centre of Excellence, his research interests include using bioinformatics to dissect cancer evolution. His team explore the evolutionary history of cancers through sequencing multiple regions of individual tumours. In particular, Dr McGranahan’s research has focused on understanding the importance of genome doubling in tumour evolution, exploring the mutational processes shaping the genomes of cancers over space and time, and investigating the interface between the cancer genome and the immune microenvironment.
Multiple myeloma (MM) and acute myeloid leukemia (AML) are blood cancers associated with high relapse rates. These ancer cells reside in the bone marrow (BM). As the disease evolves, it alters the microenvironment of the BM, which will eventually support the growth of malignant cells. To evade detection, cancer cells repress the immune response by activating the immune checkpoint receptors. Drugs targeting immune checkpoints have shown promise in tumor eradication, but a thorough understanding of how immune cells are affected in the microenvironment is needed to develop efficient therapies capable of reactivating immune recognition of the tumor. A scRNA-Seq approach across multiple studies would provide a high-resolution perspective of the various stages of the disease. The inclusion of more studies would allow us to draw a broad panorama of the BM microenvironment and to isolate specific cell subtypes. Finally, by focusing on the interaction of immune and cancer cells along with the progression of the disease, and by tracing the differentiation trajectories, we aim to identify new therapeutic candidates against these conditions. To this end, we developed a pipeline to process and integrate scRNA-Seq data from different studies that retrieve raw data, compute counts, filter empty/double droplets and perform quality control (Seurat). Afterward, cell subtypes are assigned using the package singleR and the reference dataset from Granja et al 2019. For MM we collected 42 samples from 4 publicly available studies obtaining a whole dataset consisting of >190,000 cells comprising all disease stages. After sorting the cells by the disease stages, we already tracked the cell types proportions present in each. The next steps will include differential gene expression analysis of the immune and cancer cells alongside the disease stages, differentiation trajectory inference and cell-cell interaction analysis to highlight events that drive the disease progression/relapse.
Cancer karyotypes are characterised by widespread copy number alterations and aneuploidy, which can cause stoichiometric imbalances between members of protein complexes. In the germline, stoichiometric imbalances affecting dosage-sensitive genes produce deleterious effects that restrict copy number variation and contribute to the lethality of nearly all germline aneuploidies. It is unclear how tumours tolerate such high levels of aneuploidy. Recent work suggests that tumours face pressure to maintain stoichiometric balance at the protein level and that the ability to buffer CNAs via dosage compensation is associated with higher tumour fitness (Senger et al, 2022). We sought to understand the extent to which CNAs are buffered in cancer by integrating copy number, transcriptomic, proteomic and metabolomic data from the CPTAC and CCLE databases. Using copy number-RNA-protein correlations, we derive a CNA attenuation score for each gene and investigate how CNA length and overall genomic ploidy affect dosage compensation. Aneuploidy is a known prognostic factor in cancer (Van Dijk et al, 2021), but we hypothesise that the effects of aneuploidy depend on the level of dosage compensation of the altered genes. We will use our gene-specific CNA attenuation scores to predict ‘effective aneuploidy’ for each patient tumour in The Cancer Genome Atlas and investigate whether effective aneuploidy is a better predictor of prognosis than aneuploidy alone.
The majority of somatic tumour mutations fall in DNA that does not encode proteins, yet it remains unknown if these non-coding mutations can act as “drivers” that promote cell fitness. Long non-coding RNAs (lncRNAs) are amongst the most highly mutated non-coding regions and have been widely linked to cancer. We previously developed ExInAtor, a software package that identifies driver genes via signatures of positive selection in somatic single nucleotide variants (SNVs). Here, we leveraged the unique collection of 16,000 tumour genomes from Genomics England to search for driver lncRNAs. Our analysis reveals evidence for 60 significantly mutated lncRNAs across 11 tumours (FDR<0.0001), a significant increase over previous collections. The resulting catalogues are significantly enriched for lncRNAs with known roles in cancer, such as NEAT1 and MALAT1. Putative driver lncRNAs are significantly associated with a range of clinical and genomic features that further support their disease relevance. Follow-up studies will employ genome-editing approaches to experimentally test the fitness effects of SNVs. These findings will provide an essential foundation for understanding the role of cancer-driver lncRNAs, and these non-coding mutation maps derived from Genomics England may offer novel strategies for personalised therapy.
Obesity is recognised as one of the primary causes of ill health globally as it contributes to diabetes, cardiovascular diseases, and is associated with 49% of cancers. High energy western diets, inactivity associated with modern day society, and genetic susceptibility are all contributing factors leading to the obesity pandemic. Several studies have shown the profound impact immune cells play in the pathogenesis of obesity as evidenced by immune cell dysfunction in tissues such as the liver and adipose. As well as causing chronic low-grade inflammation, obesity also negatively effects vaccine efficacy and leads to increased risk of complications from infection. Thus, effective therapeutic interventions for obesity and its associated co-morbidities are highly sought-after. Weight loss is one such established therapeutic intervention for dealing with obesity and its complications. Modest weight loss (~10%) improves glycaemic control, reduces blood pressure, and reduces cholesterol levels, however, it remains unclear if weight loss restores immune cell function which is dysregulated in obesity. To analyse if immune cell function is recovered following weight loss we used a combination of scRNA-seq and metabolomics to analyse tissues obtained from a weight loss mouse model. Initial exploratory analysis reveals several key genes and metabolites changed following weight loss providing insights into the mechanisms underlying immune cell dysfunction in obesity. We also analysed a human weight loss scRNA-seq cohort allowing for findings in the mouse model to be cross-checked in humans for similar changes. Collectively, these analyses may provide insights into key mechanisms of immune cell dysfunction in obesity and potential methods of therapeutically targeting such mechanisms to improve immune cell health in this cohort.
The anti-seizure drug vigabatrin (VGB) is an effective drug for controlling seizures, especially infantile spasms. However, use is limited by VGB-associated visual field loss (VAVFL). Approximately 33% of VGB-exposed adult patients experience this adverse reaction (Maguire, 2010) although the mechanisms by which VGB causes VAVFL remains unknown. Average peripapillary retinal nerve fiber layer (ppRNFL) thickness is correlated with the degree of visual field loss (measured by mean radial degrees (Clayton, Dévilé et al. 2011). Duration of VGB exposure, maximum daily VGB dose, & male sex are associated with peripapillary retinal nerve fiber thinning. We hypothesize that common genetic variation is a predictor of VAVFL. Identifying pharmacogenomic predictors of VAVFL could potentially enable safe prescribing of VGB and broader use of a highly effective drug. Methods: Optical coherence topography (OCT) data were produced on VGB-exposed individuals (n=99) from the EpiPGX consortium. We conducted quantitative GWAS analyses for the following OCT thickness measurements: 1) Average ppRNFL, 2) inferior quadrant, 3) nasal quadrant, 4) superior quadrant, 5) temporal quadrant, 6) inferior nasal sector, 7) nasal inferior sector, 8) superior nasal sector, 9) nasal superior sector. We included sex, cumulative dose, maximum daily dose, duration of prescription of VGB (years) and 4 principal components as covariates. Using the summary statistics from the GWAS analyses we conducted gene-based testing using VEGAS2. To determine if VGB exposed patients were predisposed to having a thinner RNFL, we calculated their polygenic burden for retinal thickness. We conducted nine different PRS analyses using the OCT measurements. PRS alleles for retinal thickness were calculated using the summary statistics from a large-scale GWAS of inner retinal morphology using the OCT images of 31,434 UK Biobank participants (Currant, Hysi et al. 2021). Results: The GWAS analyses did not identify a significant association after correction for multiple testing. Similarly, the gene-based and PRS analyses did not reveal a significant association that survived multiple testing. Conclusion: We set out to identify common genetic predictors for VGB-induced ppRNFL thinning (as a marker of VAVFL), under univariate & polygenic models. Results suggest that large-effect common genetic predictors are unlikely to exist for VGB-induced VAVFL. Sample size was a limitation of this study, power calculations show we were underpowered in this study. However, recruitment is a challenge as VGB is rarely used today because of this adverse reaction. Rare variants may be predictors of this ADR and were not studied here.
Intracranial Haemorrhage (ICH) and stroke are common causes of death among kidney donors, but the underlying genetic factors involved are poorly understood. Both ICH and stroke are highly heritable and polygenic. Polygenic burden for a trait can be calculated by generating a Polygenic Risk Score (PRS) which estimates the cumulative effect of common genetic variation on an individual’s disease status. Previous studies have found that PRSs for ischemic stroke offer predictive performance similar to clinical risk factors. Here we investigate the role of polygenic burden in determining donor cause of death and age of death as well as it’s impact on recipient graft function. Methods: We utilised 5,292 genotyped kidney transplant donor-recipient pairs from 4 different centres from across Europe. We calculated PRSs for stroke, Intracranial Aneurysm (IA) and hypertension using large published GWASs of European ancestry. We compared PRSs between the donors who died of stroke (2,642 individuals) and controls, as well as donors who died of other causes and living donors. We investigated whether donor polygenic risk impacted on donor age and on graft survival and function. Results: Donors who died of stroke had significantly higher mean PRSs for hypertension, stroke, and IA than controls (p-values 3e-15, 4e-9, and 0.001 respectively). Hypertension PRS had a significant effect on donor age of death among the donors who died of other causes. (Beta: 1.4, p: 2e-4). We also found significant effects of PRSs for hypertension and IA on graft survival as well as eGFR at 1 and 5 years post-transplant. Conclusion: These observations support the hypothesis that deceased donors carry an increased burden for traits related to stroke. Increased burden for these traits results in a greater risk of death from stroke as well as a younger age of death. These PRSs are similarly predictive to well established clinical risk factors including hypertension, current smoking and male sex. Kidneys that come from donors with high polygenic risk for all traits have a significantly worse graft function and survival from low risk kidneys. PRSs can distinguish donors from the general population. These findings could have utility in testing relatives of donors to determine if they share the same risk for stroke.
The vast majority of microbes are harmless to us, and many play essential roles in health. Others are pathogens and exert a spectrum of deleterious effects on their hosts. Infectious diseases have historically represented the most common cause of death in humans (until recently) exceeding by far the toll taken by wars or famines. From the dawn of humanity and throughout history, infectious diseases have shaped human evolution, demography, migrations and history. In this talk, I will discuss some aspects of our long-standing relationship with microbial pathogens and how genomics can unravel that complicated history. I will talk about how genomics can be applied to the past, present and future. The past will focus on the enteric pathogen Salmonella enterica and genomes recovered from long since deceased hosts. For the present, I will talk about Salmonella enterica ser. Agona from European badgers. For the future, I will reflect on our experience of tracking the culprit of the COVID-19 pandemic, SARS-CoV-2, and the steps required to combat future pandemics.
Dr. Nabil-Fareed Alikhan , Quadram institute
Nabil-Fareed is currently a Bioinformatics Scientific Programmer with Andrew Page at the Quadram Institute Bioscience. He was previously a Senior Research Fellow with Mark Achtman at the University of Warwick. He completed his PhD in 2015, under the supervision of Scott Beatson at the University of Queensland. He graduated from UQ with a Bachelor of Information Technology and a Bachelor of Science in 2008.
Nabil-Fareed's recent contributions include analysis and data curation for Quadram Institute as part of the COG UK Consortium National effort for tracking COVID19 (SARSCoV2) through genome sequencing. He also develops bioinformatics infrastructure within Quadram Institute, moving towards open web platforms and cloud computing. He contributes to the infrastructure working group within The Public Health Alliance for Genomic Epidemiology (PHA4GE). Nabil-Fareed's research interest include the population genetics and pathogenesis of enteric pathogens, including Salmonella and E. coli.
Lager brewing first occurred in Bavaria in the 15th century, associated with restrictions of brewing to colder months. The lager yeast, Saccharomyces pastorianus, is cold tolerant. It is a hybrid between Saccharomyces cerevisiae and Saccharomyces eubayanus, and has been found only in industrial settings. Natural isolates of S. eubayanus were first discovered in Patagonia 11 years ago. They have since been isolated from China, Tibet, New Zealand and North America, but not fromEurope. Here, we describe the first European strains UCD646 and UCD650, isolated from a wooded area on a university campus in Dublin, Ireland. We generated complete chromosome level assemblies of both genomes using long and short read sequencing. The UCD isolates belong to the Holarctic clade. Genome analysis shows that isolates similar to the Irish strains contributed to the S. eubayanus component of the S. pastorianus isolates, but isolates from Tibet share more similarities.
The development of ancient genomic techniques has allowed the exploration of both host genomes and the genomes of commensal and pathogenic microbes. Teeth provide a particularly rich substrate: as well as endogenous host DNA, they are also a source of DNA from both bloodborne pathogens and the oral microbiota. We carried out metagenomic screening of a dataset of 344 tooth and calculus samples from diverse prehistoric and Medieval contexts. This identified two Early Bronze Age teeth, samples from a limestone cave at Killuragh, Co. Limerick, which had exceptional preservation of oral pathobiont species. De novo genome assembly could be carried out for two species, Streptococcus mutans and Treponema denticola. S. mutans is a key causative agent of dental caries, due to its ability to metabolise carbohydrates, form biofilms and survive acidic environments. The Killuragh genome is the first reported ancient S. mutans genome: here we present our new genome and preliminary analysis of phylogenetic placement and virulence factors. Our results show this 4,000 year old genome is nested within modern S. mutans diversity, demonstrating the ancient association between host and pathogen. Furthermore, while 93.7% of modern genomes harbour at least one mutacin, proteins important for the colonisation and establishment of S. mutans in the dental biofilm, the Killuragh genome is negative for all tested mutacins.
The FAIR (Findable, Accessible, Interoperable, Reusable) principles laid a foundation for sharing and publishing scientific data, now extending to all digital objects including software with the recent publishing of the FAIR principles for Research Software (FAIR4RS). One kind of software widely used in Biosciences is computational workflow systems – whose adoption has accelerated in the past few years driven by the need for repetitive and scalable data processing, access to and exchange of processing know-how, and the desire for more reproducible (or at least transparent) and quality assured processing methods. Over 320 workflow systems are currently available, although a much smaller number are widely adopted. As first class, publishable research objects, it seems natural to apply FAIR principles to workflows. But what does the FAIRness of workflows mean and why does it matter? ELIXIR, European Research Infrastructure for Life Science Data, the European EOSC-Life Workflow Collaboratory, Australia BioCommons and the Workflow Community Initiative, alongside big workflow players like Galaxy, snakemake and Nextflow, are developing an ecosystem of tools, guidelines and best practices to make bioscience workflows FAIR. In this talk I will shine a light on this work, ranging from the bigger picture of strategic directions to the practicality of daily work in the lab.
Professor Carole Goble , The University of Manchester
Carole is a Full Professor of Computer Science at the University of Manchester, UK where she leads the e-Science Group of Researchers, Research Software Engineers and Data Stewards. She has 30+ years’ experience of research reproducible science, open data and method sharing, knowledge and metadata management and computational workflows in a range of disciplines, notably the Life Sciences. She has developed production services for workflows, web services, and data management and co-led digital infrastructure projects and resources.
Carole is the Joint Head of Node of ELIXIR-UK, the national node of the ELIXIR European Research infrastructure for Life Science data, and leads the digital infrastructure for IBISBA, the EU Research Infrastructure for Industrial Biotechnology.
At the national level she serves on the leadership team of Health Data Research-UK and is a founder of the UK’s Software Sustainability Institute. Carole serves on the Board of Directors of Sage Bionetworks, the SAB of Helmholtz Metadata Collaboration, and 10+ other centres, and is the UK representative on the G7 Open Science Working Group. She has a long-time activity in computational workflows – developing workflow platforms (Taverna) , public resources for sharing workflows (myExperiment, Workflowhub) and metadata frameworks for reproducible workflows (RO-Crate, Bioschemas, CWL). She is a pioneer of Open and FAIR data and software in scholarly communication and an author of the original Nature FAIR Scientific Data Principles article. These two threads come together as FAIR Computational Workflows, which she leads for the Workflows Community Initiative.
Epigenetic clocks are statistical models that estimate aging using DNA methylation data from selected CpG sites across the genome. These estimates, known as epigenetic age, are strongly correlated with chronological age, but departures from these predictions, known as epigenetic age acceleration (EAA), are associated with various exposures or health outcomes. Most studies of epigenetic age are cross-sectional with a single epigenetic age measurement. However, the recent growth in cohorts with longitudinal DNA methylation data has increased the number of research studies on epigenetic age over time. In this study, we explore methods for modelling epigenetic age over time using a review of the literature and simulation study of commonly used approaches. A review of the latest literature discovered major discrepancies in methods to model longitudinal epigenetic age data. Methods differed in their (i) outcome (e.g., EAA or epigenetic age), (ii) time variable (e.g., chronological age or categorical time at measurement), and (iii) statistical model to capture change over time. While most studies used epigenetic age as the longitudinal outcome of interest, others include EAA instead. We identified two equally common ways to account for time in a model – either to include chronological age or categorical time. We identified three frequently used methods to measure change over time: (a) linear mixed effect models (LMM), (b) delta aging, which quantifies the difference between follow-up and baseline measurements, and (c) generalized estimating equations (GEE). The most common approach was to use LMM with a random intercept term, but rarely a random slope term, to account for repeated measurements. Using a simulation study, we tested the robustness of approaches for selecting the outcome, time variable, and statistical model in estimating the effect of an exposure X on longitudinal epigenetic aging. Modelling either epigenetic age or EAA as the outcome of an LMM, including both a random intercept and slope term, gave unbiased estimates for the effect of an exposure on epigenetic aging. Chronological age was a more robust time variable than categorical time at measurement. GEE and LMM without a random slope term led to similar accuracy, while the delta aging approach resulted in large bias when estimating the effect of an exposure on epigenetic aging. In summary, these results provide guidance for future studies evaluating associations between exposures and longitudinal epigenetic age. Ultimately, more consistent methods in epigenetic aging studies will improve the robustness and reproducibility of findings.
mRNA and protein abundances correlate only moderately. It is unclear to what extent this moderate correlation reflects post-transcriptional regulation and to what extent it reflects measurement error. Here, by analysing replicate proteomic profiles of tumours and cell lines, we show that there is considerable variation in the reproducibility of proteins measured using mass spectrometry (MS). Proteins with more reproducible MS measurements tend to have a higher mRNA-protein correlation. The reproducibility of individual proteins is somewhat consistent across studies, and we exploit this to develop an aggregate reproducibility score that explains a substantial amount of the variation in mRNA-protein correlations across multiple studies. A complementary approach to MS based proteomics is to measure proteins using Reverse Phase Protein Arrays (RPPA) wherein hundreds of proteins are measured using antibodies. However, we find that the variable quality of antibodies in these arrays has a significant influence on the observed mRNA-protein correlations. Overall, these results suggest that reliability of protein abundance measurements accounts for a substantial fraction of the unexplained variation between mRNA and protein abundances.
The ability to measure gene expression levels for individual cells (vs. pools of cells) is crucial to address many important biological questions, such as the study of stem cell differentiation, the detection of rare mutations in cancer, or the discovery of cellular subtypes in the brain. Single-cell transcriptome sequencing (RNA-Seq) allows the high-throughput measurement of gene expression levels for entire genomes at the resolution of single cells. RNA-Seq studies provide a great example of the range of questions one encounters in a Data Science workflow, where the data are complex in a variety of ways, there are multiple analysis steps, and drawing on rigorous statistical principles and methods is essential to derive reliable and interpretable biological results. In this talk, I will provide a survey of statistical questions related to the analysis of single-cell RNA-Seq data to investigate the differentiation of stem cells in the brain, including, exploratory data analysis, dimensionality reduction, normalization, expression quantitation, cluster analysis, and the inference of cellular lineages.
Professor Sandrine Dudoit , University of California, Berkeley
Professor Sandrine Dudoit is Associate Dean for the Faculty in the Division of Computing, Data Science, and Society, Professor in the Department of Statistics, and Professor in the Division of Biostatistics, School of Public Health, at the University of California, Berkeley. Professor Dudoit's methodological research interests regard high-dimensional statistical learning and include exploratory data analysis (EDA), visualization, loss-based estimation with cross-validation (e.g., density estimation, classification, regression, model selection), and multiple hypothesis testing.
Much of her methodological work is motivated by statistical questions arising in biological research and, in particular, the design and analysis of high-throughput sequencing studies, e.g., single-cell transcriptome sequencing (RNA-Seq) for discovering novel cell types and for the study of stem cell differentiation. Her contributions include: exploratory data analysis, normalization and expression quantitation, differential expression analysis, class discovery and prediction, inference of cell lineages, and the integration of biological annotation metadata (e.g., Gene Ontology (GO) annotation). She is also interested in statistical computing and, in particular, computationally reproducible research. She is a founding core developer of the Bioconductor Project (http://www.bioconductor.org), an open-source and open-development software project for the analysis of biomedical and genomic data.