8
  Age.of.Diagnosis Native.American European  African
                60             0.5      0.4      0.1

I have the ancestry data on ~100 individuals with cancer, and I would like to know how ancestry affects the age of diagnosis. Essentially I have 2 continuous predictors (ancestry) and 1 continuous response variable (age of diagnosis). What statistical method might be appropriate here?

Adrian
  • 2,869
  • 5
  • 32
  • 53
  • 2
    It would appear you actually have only two predictors: isn't each one always going to be equal to $1$ less than the sum of the other two? – whuber Sep 22 '15 at 18:18
  • @whuber Oh yes, you are exactly right. I do have two predictors. Thanks for point that out. – Adrian Sep 22 '15 at 18:54
  • 3
    What exactly does a record in your data set represent? Do 47.8% of all persons who were diagnosed cancer at 55y are native American? Or is it a single persons exact ancestry? – Michael M Sep 22 '15 at 18:55
  • 1
    perhaps survival analysis, if you know when they were monitored eg all subjects monitored from 50-60 years of age, so you know subject 1 didnt test positive for cancer between 50 and 54 years], as opposed to only went to the doctor when 55. – seanv507 Sep 22 '15 at 19:15
  • @seanv507 I considered doing survival analysis, but no I don't know when they were monitored. In this case would doing survival analysis be inappropriate? – Adrian Sep 22 '15 at 19:20
  • If you don't know the monitoring period then survival analysis is inappropriate – seanv507 Sep 22 '15 at 22:38
  • be cautions if any of the groups are immigrants, as the age-distribution among immigrant groups trend to be very different from other populations within a country. This may seriously bias the conclusions. Age-adjusted incidence rates could be considered. – Adam Robinsson Sep 25 '15 at 09:35
  • i would go first with regression trees, because their results are easily interpretable (if the tree is not too deep...) – agenis Sep 25 '15 at 09:54
  • 1
    I agree with @seanv507; cancer detection is a time-to-event problem. Pathophysiologically, @Adrian, do you have in mind an accelerated failure time model, with ancestry as one determinant of the frailty? Spending some time considering the details of such models might scare you off concluding anything from these data without more information--which could be a good thing. Are there meaningful competing risks within the age range you're considering? Is this an indolent cancer (e.g., prostate) where intensity of surveillance (as opposed to the underlying disease process) could drive detection? – David C. Norris Sep 26 '15 at 15:59
  • 1
    My discouraging comments above notwithstanding, it is nice to see ancestry treated in a continuous, multidimensional manner in your data, @Adrian, instead of the usual categorical way. – David C. Norris Sep 26 '15 at 16:01
  • 2
    @Adrian See http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4069235/ – Saket Choudhary Sep 27 '15 at 19:33
  • @Adrian: You deleted most of your question in an edit. I suppose it was an accident, as you've already two answers & several comments referring to the deleted contents; so I've rolled it back. – Scortchi - Reinstate Monica Oct 05 '15 at 22:20
  • @Adrian: What's going on? The answers will become confusing to readers if you change the question like this? – Scortchi - Reinstate Monica Oct 05 '15 at 23:03

2 Answers2

4

There are examples of using Multiple Linear Regression for similar studies[1]

Here is an notebook example of doing this in R.

[1] Genomic ancestry and somatic alterations correlate with age at diagnosis in Hispanic children with B-cell ALL

  • Good ol' linear regression is a tempting choice, but we need to know the study design in order to conclude anything about these data. – Andrew M Sep 28 '15 at 23:48
1

Here is a somewhat simplistic (Bayesian) model for your problem. I did not complete it (I am not 100% positive I can). Let me know if this seems reasonable, it looks like a nice problem!

Imagine a human being is made of three independent parts: a native american part (1), a european part (2) and an african part (3). Thus, a person is represented as $x \in \mathbb{R}_{+}^3$ (meaning $x_1,x_2, x_3 \geq 0$) with the constraint $x_1+x_2+x_3 = 1$.

In order to get cancer, a human must get "enough" cancer in their inidividual parts. So if a person $x = (\frac{1}{3},\frac{1}{3},\frac{1}{3})^t$, then they can get cacner if each part gets cancer independently. Alternatively, they can get cancer by summing: say the european part gets cancer twice, the african part gets cancer once and the native american doesn't get cancer at all. In this context it is assumed that getting cancer and being diagnosed are the same (I know it might not hold but we can make things complicated later).

The probability for a part to get some number of cancers in time $T$ is modeled as Poisson process(very reasonable: it counts how many hits one gets in a given time) with some corresponding parameter $\lambda_i$. Consequently, for and individual $x$, the occurence of cancers until time $T$ is distributed like $ \sum_{i=1}^3 x_i N_i(T)$, where each $N_i(t)$ is a Poisson process with parameter $\lambda_i$. We are only interested in the first time the individual gets a total of 1 cancer or more. I believe this time should be exponential random variable with parameter that is a kind of average of the $\lambda_i$'s (I don't know that for sure, this is the missing part). \begin{eqnarray} \begin{split} % % P( \text{Cancer before time } t|\lambda, x)&= P( \sum_i x_i N_i(t) \geq 1) \\ % % &= \sum_{k_1,k_2,k_3} \prod_{i=1}^3 \frac{e^{-\lambda_i} (\lambda_it)^{k_i}}{k_i!} \\ % % &= \sum_{k_1,k_2,k_3} \frac{e^{\lambda_1+\lambda_2+\lambda_3} \lambda_1^{k_1}\lambda_2^{k_2}\lambda_3^{k_3}t^{k_1+k_2+k_3}}{k_1!k_2!k_3!} \end{split} \end{eqnarray} where the sum is over triplets $(k_1,k_2,k_3)$ such that $\sum_i k_i x_i \geq 1$. This seems to be the hard part. However, it looks like one can estimate it numerically, so that might just be enough.

If one has this distribution $P( \text{Cancer before time } t|\lambda, x)$, then calculating the data likelihood $p(\text{Data} | \lambda )$ should be trivial (a product of iids with the above distributions). Choose a reasonable prior for $\lambda$ and Bayes rule will allow you to get the probability $p(\lambda | \text{Data})$. For a new individual $x$, the distribution of age they get cancer is going to be the predictive distribution $$ p( \text{Cancer} | x, \text{Data} ) = \int p( \text{Cancer} | \lambda, x) p(\lambda | \text{Data} ) d\lambda, $$ which can (probably) be estimated using Gibbs sampling.

Yair Daon
  • 2,484
  • 1
  • 18
  • 31