Het vierde jackpot winnende loterijkaartje van Joan Ginther (PhD in de statistiek!).
Doel en inhoud van de cursus
In deze collegereeks bestuderen we controversiële toepassingen van de
statistiek op een aantal gebieden: in de rechtszaal, bij medisch
onderzoek en in het maatschappelijk debat; bijvoorbeeld:
klimaat-verandering, misdaad en etniciteit. Doel is om de deelnemers
attent te maken op veel voorkomende denkfouten in het redeneren over
toeval en bij het trekken van conclusies aan de hand van een beperkte
hoeveelheid waarnemingen. De docenten hebben voor ogen dat de
deelnemers na afloop van de reeks met de nodige kritische instelling de
resultaten van statistisch onderzoek in de media en maatschappij kunnen
beoordelen.
De volgende onderwerpen zullen aan de orde komen: correlatie versus
causaliteit; schijncorrelatie; de paradox van Simpson; publicatiebias
en het decline-effect; selectiebias; interpretatie van conditionele
kansen, en de valkuilen van de aanklager respectievelijk de verdediger
in rechtszaken.
We zullen beginnen met een inleidend college om het
begrippenkader vast te stellen, waarbij we ook eenvoudige klassieke
paradoxen uit de kansrekening zullen behandelen en analyseren zoals het
drie-deuren-probleem, het broer- en-zusje-probleem, het
twee-enveloppen-probleem.
Vervolgens zullen we telkens een deskundige laten praten over een
actuele casus, vergezeld met een voordracht over de probabilistische en
statistische aspecten, gegeven door een van de organisatoren.
Een
wiskundige of beta-achtergrond is niet nodig. De focus van de colleges
zal liggen op de ideeën en de denkmethoden.
8 maandagen, 10 okt. t/m 12 dec. (14 november en 5 december geen college)
15.15-17.00 uur
Plaats: Pieter de la Courtgebouw, Wassenaarseweg 52, zaal 1A41 (1e verdieping, A vleugel, zaal 41).
Werkvorm: hoorcolleges met gelegenheid tot het stellen van vragen.
Zelfstudie: ca. 3 uur per week.
Vereisten: leesvaardigheid Engels.
Cursusmateriaal:
• Teksten op internet en op deze speciaal voor de cursus ingerichte site.
Docenten:
Richard D. Gill (hoogleraar mathematisch statistiek, Univ. Leiden) (hieronder: RDG)
Peter D. Grunwald (bijzonder hoogleraar statistisch leren, Univ Leiden; en CWI, Amsterdam) (PDG)
Willem R. van Zwet (emeritus hoogleraar mathematisch statistiek, Univ. Leiden) (WRvZ)
Voorlopige programma:
10 october |
Inleiding |
RDG & PDG |
||
17 october |
Doping controles in de sport |
Klaas Faber |
RDG |
De zaak Claudia Pechstein |
24 october |
Statistisch onzinnig neuro-onderzoek |
Erik-Jan Wagenmakers |
PDG |
|
31 october |
Klimaatverandering |
Marcel Crok |
PDG |
Het "hockey-stick" phenomeen |
7 november |
De zaak Lucia de B. | Metta de Noo | WRvZ | Statistiek ontspoort in de rechtszaal |
14 november |
GEEN COLLEGE | |||
21 november |
"Bias" in Epidemiologisch onderzoek |
Luc Bonneux |
RDG |
Screening voor borst-kanker |
28 november |
Evidence Based Medicine: wat mis kan gaan |
Richard Gill |
RDG |
Het Utrechts Probiotica proef |
5 december |
GEEN COLLEGE |
|||
12 december |
Forensisch DNA |
Peter de Knijff |
RDG |
Casus: een problematisch meng-spoor; combinatie van verschillende soorten DNA bewijs. |
Announcements:
Remarks on the Probiotica case.RDG will demonstrate some simulations and other calculations for randomized clinical trials using the R language for statistical computing. Intrepid participants might be interested to download and install R themselves, in order to try out some R scripts themselves. The main site is www.R-project.org. R is a computer environment for statistical modelling and data analysis. Nowadays it is used all over the world and users from genetics, finance, ... - you name it, they have done it - have added specialist packages for their special kinds of statistical problems. It is free software: where the word "free" is to be understood both in the sense of "free" in the phrase "free beer", and as "free" in "free speech". This certainly has contributed to the enormous user base, the active community of users, and the reliability of the software.
Remarks on the Pechtstein case.
RDG plans to add to the site an "executive summary" of Faber's results. In particular, regarding the following excellent question from the audience. "How on earth they can (think they) know that a certain threshold only has a probability 1 in a thousand of being exceeded?". You would need hundreds of thousands of independent measurements to be reasonably certain about this, if you want direct empirical evidence of where the threshold should be. They don't have that. Instead, an enormous extrapoltion is being made. It is made essentially by assuming a normal distribution, fitting centre and spread of the distribution from a much smaller number of measurements, and then extrapolating. Unfortunately, in the real world, though many distributions look roughly normal in the middle, they are often far from normal in the tails. (To be continued)
Maar wat *is* toeval dan?
Several times we were asked "wat betekent het eigenlijk, als we zeggen dat iets toeval is". RDG thinks there are many many answers, depending on the context. What is chance? Courts, scientists, ordinary people in everyday life, are continually seeing things which seem to have meaning, which seem to demand an explanation. But perhaps it is just chance. For instance, was it just chance that Joan Ginther won the jackpot in the lottery four times over the space of about 15 years? Could be, of course... But suppose you happened to learn that she had a PhD in statistics from Stanford university, lived in Las Vegas, was well off, did not seem to have any job, but just occasionally travelled to a small town with a huge amount of money strapped to her waist and bought all the tickets from the town store? (Continued, here; warning: amateur philosophy, not science!).
Genetic councelling.
I showed you just a little bit of a story about Mrs X, one of a family of 10 brothers and sisters, whose mother, aunt (mother's sister) and grandmother (on their mother's side) had all been diagnosed with cancer at relatively young ages (before age 60). A clinic at a university hospital did genetic testing on the family but by the time four brothers and sisters had been tested, only negative results had been obtained. The clinic said that because of a 1 in 16 conditional chance that four brothers and sisters would all not have a particular gene, given that their mother did, there was no point in testing more of the family. It was much more likely, it was claimed, that another genetic factor was responsible for the clustering of cancer cases in the family.
I built a so-called Bayes' net for Mrs X, describing the probability
rules for
inheritance of genes in such a family, and describing how the chance of
cancer depends on the genes each person carries. The most realistic
model I made assumed three "bad" genes (possible harmful mutations of
three different "good" genes). Two of these represent the now well
known
BRCA1 and BRCA2 genes (or gene mutations). The harmful mutation of each
of these two
genes is very rare (5 in 10 000 people carry them), but if you are
unfortunate enough to have either,
your risk of getting early cancer is very large indeed (about 50%).
Many other (mutations
of) (other) genes exist, for sure, but have not been identified yet,
precisely because each separately is a good deal less harmful. We know
this because the dependence within families is much larger than can be
accounted for either by BRCA1 or BRCA2, and much larger than can be
accounted for by environmental and life-style factors. So my third gene
is somewhat imaginary: but it stands for something very real: a lot of
genetic and shared environmental factors together. I gave this
pseudo-gene a population frequency, and an effect of increasing the
risk of cancer, tuned so as to reproduce the actually observed correlations in families.
You can download text files describing a one-gene model (BRCA1.xdsl), a two-gene model (BRCA2.xdsl), and a three-gene model: a basic version (BRCA3basic.xdsl), and one with some extra stuff added (BRCA3.xdsl). These files can be read by the program GeNIe. In GeNIe you can fill in observed values of phenotypes and watch how the probability distribution of the genotypes of the persons in the family change when we take account of these observations. Next we can fill in the observed genotypes - with regard to BRCA1 and BRCA2 - of four siblings, and watch again how this information changes the probabilities that Mrs X carries either mutation, or another bad gene.
De computer liefhebber zou zich kunnen amuseren door zelf Bayes' netten te ontwerpen en door te rekenen op zijn PC (of Mac, of Linux computer) thuis.