Back to Navigation Page

NonParametric Hypothesis Testing and Estimation




NonParametricTest

The syntax of this proc is
q = NonParametricTest(data,dep,indep,test);
The input data here is a NxP matrix, dep are the dependent variables whereas indep is, when appropriate, a group variable. The final entry, test is a string defining which test to do. The output q is a packed vector containing a variety of information. Here are a few, related, examples:
  • q = NonParametricTest(data,5,0,"Wilcoxon");
    produces a signed Wilcoxon test on the fith column of data. This is because there is no group variable.
  • q = NonParametricTest(data,5|6,0,"Wilcoxon");
    also produces a signed Wilcoxon test, on the difference of the fith and sixth columns of data.
  • If the third column of data defines which of two groups the record belong to, the code
    q = NonParametricTest(data,5,3,"Wilcoxon");
    produces the classical Wilcoxon test on the fith column of data.
  • If the third column of data defines which of more than two groups the record belong to, the code
    q = NonParametricTest(data,5,3,"Wilcoxon");
    produces the Kruskall-Wallis test on the fith column of data.
In all these cases the test is based on the appropriate linear rank. We can do the same test on normal scores instead by inserting the option
AddOption("Normal Scores");
prior to the call. This test is usually called the normal score test, or van der Waerden test. We can alternatively call it by either of
q = NonParametricTest(data,dep,indep,"Normal scores test");
q = NonParametricTest(data,dep,indep,"Van der Waerden's test");
Another alternative is to use so-called Savage scores, which is done similarily by
AddOption("Savage scores");
or, since this test usually is called the logrank test, by
q = NonParametricTest(data,dep,indep,"logrank test");
There are however even more bells and whistles for this test. Normal scores can be computed by three different methods: Bloms, Tukeys, van der Waerdens which are called by the respective keywords Blom, Tukey, vW. Bloms method is the default. Tied ranks are by default given the mean rank. We can override this by assigning the low/high rank instead by giving the appropriate word among the options. We can also rank from highest to lowest instead of the converse, by giving the option descend. Other variants of the Wilcoxon test can be called by their names: Mood, Klotz and Conover. A many-sample test that is not of Wilcoxon type is the Jonckhere-Terpstas trend test which can be used for dose-response studies.

All many-sampled tests can be analysed stratified on other variables. These are then specified by the global _rmn_strata, which is a Lx2-vector specifying L different variables defining strata. The first column should contain column number (negative for character data), the second a name for the strata. A stratified test can be done weighted according to strata size or equally weighted. The default is that strata are weighted according to size. To override this, add the option NoWeights.

For dose-response studies within subjects we have Page's trend test, which is called by a command like

q = NonParametricTest(data,dep,0,"Page's test");
where dep is a sequence of variables ordered by, say, increasing or decreasing doses. A much weeker test, for unordered variables, is Friedmann's test. A stronger(?) test for this situation is the aligned rank test. Another trend test is Theils test, which is essentially a non-parametric linear regression:
q = NonParametricTest(data,dep,indep,"Theil's test");
where indep is the independent variable and dep the dependent one. The output (both printed and q) contains not only the test P-value, but also a non-parametric estimate of the slope.

One final note. By default P-values are computed by large-sample approximations. You can override this by

AddOption("Exact");
However, the underlying method of obtaining exact P-values is combinatorial, and actually goes through all available combinations. So this should only be done on small samples, unless your computer is really powerful.

Back To Top


HodgeLehmannEstimate and WilcoxonProb

Often we might append a nonparametric test with some quantative measure of effect. In Riemann we have two such available as of today:
  • The Hodge-Lehmann Estimate of location, and
  • an estimate of the probability P(X < Y).
both with confidence limits. The syntax for the former is
out = HodgeLehmannEstimate(data,dep,indep);
with the same input as for NonParametricTest. The output contains the estimate and its confidence limits. The syntax for the latter is
out = WilcoxonProb(data,dep,indep);
with similar input/output. In both cases pairwise comparisons are made between all groups defined in indep. Hodge-Lehmann estimate can also be obtained when this is 0, in which case an estimate of a symmetri point of the distribution is made.

Back To Top


SurvivalTest

The syntax of this proc is
{z,cov,table} = SURVIVALTEST(dataset,dep,indep,rho);
Here dataset can be either a data matrix in memory or a dataset on disk. Variables are defined either by column numbers or variable names. The independent variable should define a group belonging (negative number for character data). The dependent variables in dep should be either 1,2 or 3:
  • If dep has only one entry, this variable should contain survival time and the assumption is that there are no censoring at all
  • If dep has precisely two entries, the first should contain survival time and the second should be a censor variable.
  • If dep has three entries, the first should contain start of observation time, the second end of observation time and the third should be the outcome/censor variable.
The final input, rho, defines what type of test is done: If this is a numeric number this defines a member of the Fleming-Harringtons family of tests. The case rho = 0 corresponds to the logrank test, whereas the case rho = 1 gives an approximation of the Peto-Peto modification of the Wilcoxon test. If we take rho = "Wilcoxon" the Gehan-Wilcoxon test is obtained.

We obtain stratified tests by using the global variable _rmn_strata to define strata.

The output contains information from which other tests can be obtained:

  • z is the deviance (observed-expected)
  • cov is the covariance matrix of the deviance
  • table outputs the table that is printed by the test if output is on.
The following examples illustrates the usage. In the appropriate setting the code
{z,cov,table} = SurvivalTest(leukemia,2|3,-4,0);
produces the output
group      N      Obs.   Exp.   (O-E)^2/E
-------------------------------------------
M          11       7   10.69      1.273
NonM       12      11    7.31      1.862

Chisq = 3.40 on 1 degrees of freedom,  P = 0.06534
for a logrank test. If we sant to construct a test for a particular linear combination a of the deviances, we can form the appropriate test statistics from z*a and a'z*a.

Back To Top


SurvivalFunction and SurvivalPercentiles

The command
{out, means} = SurvivalFunction(data,dep,keys);
estimates the survival function from data defined by the columns dep in a data matrix data. Like for SurvivalTest we can use 1,2 or 3 rows for dep. If keys is nonzero, individual survival functions are estimated for each key combination.

The output out is a large matrix which starts with key-combination, follow up with jump-time with some basic information. Then follows the Nelson-Aalen estimator of the cumulative hazard function, with standard error and confidence limits. Finally we have the survival function estimate with standard error and confidence limits. In all we have n+11 column, where n is the number of key variables (n = 0 if no key). In a similar way, means contains keys followed by estimated mean and standard deviation of the survival distribution.

There are a number of options for SurvivalFunction, which are discussed below.

Here is a simple example:

{out,means} = SurvivalFunction(leukemia,2|3,-4);
in which case column 4 of leukemia is a character vector. Part of the output is as follows:
group  time  at risk events   S_KM    se    95% conf limits
M        0      11     0     1.000       .         .         .
M        9      11     1     0.909  0.2944    0.5390    0.9867
M       13      10     1     0.818  0.3410    0.4729    0.9512
M       18       8     1     0.716  0.3737    0.3645    0.8990
M       23       7     1     0.614  0.3907    0.2854    0.8353
M       31       5     1     0.491  0.4052    0.1802    0.7534
M       34       4     1     0.368  0.4033    0.1132    0.6570
M       48       2     1     0.184  0.3918    0.0288    0.5250
NonM     0      12     0     1.000       .         .         .
NonM     5      12     2     0.833  0.3280    0.5235    0.9555
NonM     8      10     2     0.667  0.3689    0.3753    0.8597
NonM    12       8     1     0.583  0.3773    0.2906    0.8009
NonM    23       6     1     0.486  0.3849    0.2024    0.7297
NonM    27       5     1     0.389  0.3834    0.1421    0.6498
NonM    30       4     1     0.292  0.3724    0.0901    0.5609
NonM    33       3     1     0.194  0.3491    0.0476    0.4614
NonM    43       2     1     0.097  0.3031    0.0166    0.3489
NonM    45       1     1     0.000       .         .         .
except that the heading is not included in the output vector.

Two natural things to do with a survival function is to plot it and to compute percentiles from it. The latter is done by the command

perc = SurvivalPercentiles(out,keys,p);
where out is the output from SurvivalFunction. The input keys give the key variables in out (these come first, negative numbers for character data. Finally, p is a vector of numbers between 0 and 1 defining percentiles.

Continuing on the previous example with the code

perc = SurvivalPercentiles(out,-1,0);
means = merge(means,-1,q[.,1 3:5],-1,0);
fmt = ("-*.*s"~8~8)|reshape("*.*lf"~10~2,2,3)|
          reshape("*.*lf"~8~0,3,3);
"Group         mean      sem     median   95% conf.limits";
"------------------------------------------------------------";
call printfm(means,0~ones(1,5),fmt);
we get the table
Group         mean      sem     median   95% conf.limits
------------------------------------------------------------
M            52.65     19.83      31      13       .
NonM         22.71      4.18      23       8      33
To plot is facilitated by the proc PlotCurves, as illustrade in the following piece of code which takes the output from above:
xy(PlotCurves(out,-1,2|9));
(column 2 is time in out and column 9 is the survival function estimate).

As already mentioned there are a number of options available for how to estimate the survival function and its confidence limits.

  • The default is that the Kaplan-Meier estimator is computed. You get the Fleming-Harrington estimator instead by adding the option Fleming-Harrington.
  • To tie-correct the Nelson-Aalen estimator, use the option Efron.
  • The confidence intervals are by default estimated on the log-scale (this is how e.g. S-Plus does it). If you want them estimated on the original scale, use the option ciplain; if you want it estimated on the log-log-scale (as in the book by Anderson-Gill etc) use the option ciloglog.
  • To get the lower confidence limit estimated by the method of Peto, use the option Peto.

Back To Top


CompeteRiskP

If we call SurvivalFunction in a competing risks situation, i.e. when there are more than one type of event that can happen, we get the survival distribution for one type, when all the other types are considered censored. To compute the risk of a specific event, in an enviroment where other event types are working we call
out = CompeteRiskP(data,dep,keys);
which has the same input data as SurvivalFunction.

Back To Top


Comments and suggestions, please mail: Anders Källén
Last modified: 98-09-12 9:00