Back to Navigation Page

Basic Descriptive and Inferential Statistics in Riemann




Descriptive and elementary inferential statistics is done by a number of procs in Riemann that are described here, together with short examples of their usage.

A few notations are common to all procs: dataset is either a data matrix in memory or a Gauss dataset on disk, vars and cvars are columns/variables specified in dataset. Specification can be done either by column number of by variable names.

Descriptive statistics:

  • ameanc, gmeanc, medianc computes arithmetic, geometric and medians respectively, with pairwise deletion (i.e. missings are ignored columnwise)
  • Frequency computes the number of occurrances of different values of a variable, possibly within subgroups defined by control variables.
  • DescStat computes N, mean, standard deviation, minimum and maximum of variables, possibly within subgroups defined by control variables.
  • Quantiles computes quantiles, including medians, as specified. As default it uses the Empirical Cumulative Distribution Function, but linear interpolation can be used if specified as an option.
  • SummaryStatistics is essentially a print-out proc calling DescStat and Quantiles.

Elementary statistics of the mean:

  • MeanCI computes confidence intervals and (optionally) P-values (for the hypothesis of mean equal 0) for means, possibly within subgroups defined by control variables
  • MeanDifferenceCI computes confidence limits and (optional) P-values for mean difference, possibly within subgroups defined by control variables. Thus this proc does classical Student's t-test.
  • FiellerInterval computes the Fieller confidence limits for the ratio of two mean values.

Other basic statistical methods:

  • Correlation computes the correlation between two variables together with confidence interval and P-value for the hypothesis of no correlation.
  • BinomialProportions computes confidence limits for two binomial proportions in a variety of ways. It uses 'exact' likelihood methods.
  • MantelHaenzelOR computes confidence limits for the odds ratio from a sequence of 2-by-2-tables. Can be done both approximately and exact.


ameanc, gmeanc, medianc

Let x be a nxp-matrix. As such it represents p different variables with up to n observations per variable (missings are allowed). The command
means = ameanc(x);
produces the p column arithmetic means, ignoring missings. Similarily for gmeanc (geometric mean instead) and medianc (median).

Back To Top


Frequency

The syntax of this proc is
{group,freqs,nms} = Frequency(dataset,vars,cvars);
and computes the number of occurrances of values in the columns of dataset specified by vars. It can be done by subgroups if these are specified by cvars. The dataset can be both a data matrix in memory, in which case variables must be specified by column numbers (negative for character data), or a Gauss dataset on disk, in which variable names can be used.

The output is as follows:

  • group is a GxT-matrix, where G is the number of groups defined by cvars. It contains the combinations of values of cvars that define these subgroups.
  • freqs is a GxN-matrix, N is the number of categories in all variables. Which column contains what variable/variable value is given by
  • nms which is a string array containing variable name and its value.
Rows can be weighted using the global __weight. This can be either a column number/name or a vector defining individual row weights.

Back To Top


DescStat

The syntax of this proc is
{group,n,mean,std,min,max} = DescStat(dataset,vars,cvars);
The names of the left hand side defines what parameter is computed. Each of them is GxP, where G is the number of subgroups defined by cvars and P the number of variables. dataset can be either a data matrix in memory or a Gauss dataset on disk. Variable specification is either by column numbers or variable names. The output variable group contains a row for each subgroup and each row is the value combination from the variables in cvars defining a subgroup. Negative numbers should be used to specify a column of a data matrix containing character data.

Rows can be weighted using the global __weight. This can be either a column number/name or a vector defining individual row weights.

The option LOG before the call means that statistics will be computed for logged data.

Back To Top


Quantiles

This proc is similar to built-in ones. It's syntax is
y = Quantiles(dataset,vars,cvars,e);
which computes quantiles of a dataset specified by vars. What quantiles to compute is specified by the column vector e, which should contain numbers between 0 and 1 (e.g., 0.5 is the median) Computations can be done by subgroups if these are specified by cvars. The dataset can be both a data matrix in memory, in which case variables must be specified by column numbers (negative for character data), or a Gauss dataset on disk, in which variable names can be used.

Rows can be weighted using the global __weight. This can be either a column number/name or a vector defining individual row weights.

This proc uses the empirical cumulative distribution function, a step function, when computing the quantiles (so that the median is obtained in the standard way). However, the option Interpolated before the call gives quantiles obtained by the linear interpolation.

Back To Top


SummaryStatistics

is the proc to use if you want to print a summary of statistics. It calls on DescStat and Quantiles. It's syntax is
out = SummaryStatistics(dataset,vars,cvars,e);
The input is the same as for Quantiles, and the output matrix contains all requested statistics. However the main usage of this proc is to get printed summaries, which is done by having the global __output on. Weighing can be done by the global __weight. If the option LOG is in effect, geometric means and coefficient of variations (as percentage) will be computed instead of arithmetic mean and standard deviation.

Back To Top


MeanCI

computes confidence limits and (optionally) P-values for the mean, possibly within subgroups (defined by the variables in cvars). Its syntax is
out = MeanCI(dataset,vars,cvars);
A printed result is obtained by having __output on. The output is a matrix, which might need some work to identify its contents (this is left to the user). If the option LOG is in effect, confidence limits are given for the geometric mean. If the option P-value is on, P-values are also computed. The confidence degree is determined by the global _rmn_Conflevel, for which the default is .95. Finally, weighing can be done by the global __weight.

Back To Top


MeanDifferenceCI

computes confidence intervals and (optionally) P-values for the difference of the mean for two groups. Analysis can be done in subgroups. Its syntax is
out = MeanDifferenceCI(dataset,vars,group,cvars);
Here the input group defines the group belonging, wheras possible subgroups within which the comparisons should be done is defined by cvars.The output is a matrix, which might need some work to identify its contents (this is left to the user). If the option LOG is in effect, confidence limits are given for the geometric mean. If the option P-value is on, P-values are also computed. The confidence degree is determined by the global _rmn_Conflevel, for which the default is .95. Finally, weighing can be done by the global __weight

Back To Top


FiellerInterval

computes the Fieller Interval for the ratio of the mean of a bivariate normal distribution. Its syntax is
out = FiellerInterval(means,cov,df,lbl);
Here means is a vector with two estimated means, and cov the covariance matrix of these estimates, df the degrees of freedom and lbl a string specifying a name for the ratio in an output. The output vector contains mean ratio with confidence limits. Confidence limit is controlled by the global _rmn_Conflevel.

Back To Top


Correlation

computes the correlation coefficient from a Nx2-data matrix data:
out = correlation(data);
By default a standard Pearson correlation coefficient with approximative confidence limits is computed, together with the P-value corresponding to the hypothesis no correlation. With the option Spearman in effect, data is ranked within column before the analysis is done. This is the non-parametric spearman correlation coefficient.

Back To Top


BinomialProportions

takes a 2-by-2-table and considers the rows to be frequencies associated with two independent binomial distributions. Various combinations of the two binomial proportions with confidence intervals based on the likelihood: the difference, the ratio (relative risk), the odds ratio and the excess risk. The syntax is
stat = BinomialProportions(table);
Confidence degree can be specified by _rmn_ConfLevel.

Example:
The code

let table = 14 43 6 67;
call BinomialProportions(table);
produces the following output:
  Parameter         estimate     95% confidence limits  P-value
---------------------------------------------------------------
  Proportion 1         0.246         0.147,   0.367               
  Proportion 2        0.0822        0.0335,    0.16               
  Difference           0.163        0.0386,   0.297               
  Excess Risk          0.178        0.0443,   0.317               
  Relative Risk         2.99          1.29,       8               
  Odds Ratio            3.64          1.35,    10.9       0.01
Note that there is an iterative algorithm involved, so the proc may take time (or even bog down).

Back To Top


MantelHaenzelOR

This proc estimates a stratified odds ratio from a sequence of 2-by-2-tables. Computes MantelHaenzels estimate with confidence limits and the ML estimate with exact confidence limits. As default Breslow-Days test for homogeneity is given, though the user can request Zelens test instead for small tables. The syntax is
qout = MantelHaenszelOR(dataset,dep,indep,covars);
There are basically two different ways of inputting data
  • The first is to give a sequence of 2-by-2-tables in a Nx4-matrix, where the row entries should be a~b~c~d. With this choice all other entries should be 0.
  • In the other case dataset is either a matrix in memory and a dataset on disk. In that case dep should be the variable defining rows (case/control variable) and indep define columns (exposure variable). Stratification is defined by the variables covars. We can weight the observations by the global __weight.
There are two options available:
  • Exact - which computes exact confidence limits for the Mantel-Haenzel estimator of the common odds,
  • Zelen - which means that the exact Zelens test is used as a test for homogeneity, instead of Breslow-Days asymptotic test. This option is memory consuming and perhaps only feasible for small data.
Here are a two examples:

Example 1:
The following code

AddOption("Exact");  @ We want exact CIs @
data = {2 8 10 80, 4 6 15 75, 3 12 5 80, 5 10 10 75, 6 9 5 80};
call MantelHaenszelOR(data,0,0,0);
computes a common odds for 5 2-by-2-tables and gives the following output:
 Individual Odds Ratios:

   Strata             a      b      c      d         OR
 --------------------------------------------------------
      1               2      8     10     80           2
      2               4      6     15     75        3.33
      3               3     12      5     80           4
      4               5     10     10     75        3.75
      5               6      9      5     80        10.7

Breslow-Days measure of homogeneity:  T = 2.75  (P =  0.6)

 Pooled Estimate:       OR     95% confidence limits
-------------------------------------------------------
Mantel-Haenszel       4.15         2.22,    7.73
Exact (ML-estimate)    4.1         2.18,    7.62

Two-sided P-value for OR = 1.00:P = 2.53E-005 (Exact,mid-P)
Note that in this case confidence limits for Mantel-Haenzels estimate is exact. The ML-estimate is always 'exact' in the likelihood sens.

Example 2:
This example, which investigates the relation of Myocardial Infarction to Recent Oral Contraceptive use, refers to a data matrix with the following variables:

 1-SMOKING  2-Use of UC  3-Age Group  4-Group (1=MI/2=Ctl) 
 5-Number 
A traditional Fisher exact test (one table only) can be done as follows:
__weight=5;
call MantelHaenszelOR(data,2,4,0);
which gives the following result

 Estimate:              OR     95% confidence limits
-------------------------------------------------------
Mantel-Haenszel       1.68          1.1,    2.58
Exact (ML-estimate)   1.68         1.08,    2.56

Two-sided P-value for OR = 1.00: P = 0.0216 (Exact,mid-P)
A stratified analysis can be done as follows:
let lbls = "25-29" "30-34" "35-39" "40-44" "45-49";
data[.,3] = recode(data[.,3],data[.,3] .== seqa(1,1,5)',lbls);
let string _rmn_tab_replace = {2  "None@1-24@>=25"};

let string _FactorNames = "Age" "Smoking";
call MantelHaenszelOR(data,2,4,-3|1);
Note that the second covariate here (smoking) has three numeric classes which are recoded by the global _rmn_tab_replace, whereas for the age class variable we choose to actually recode it from numeric data to character data. The output is as follows:
 Individual Odds Ratios:

       Age   Smoking      a      b      c      d         OR
  ------------------------------------------------------------
     25-29      None      0      1     25    106           0
     25-29      1-24      1      0     25     79        +INF
     25-29      >=25      3      1     12     39        9.75
     30-34      None      0      0     13    175           .
     30-34      1-24      1      5     10    142        2.84
     30-34      >=25      8      7     10     73        8.34
     35-39      None      0      3      8    153           0
     35-39      1-24      1     11     11    119       0.983
     35-39      >=25      3     19      7     58        1.31
     40-44      None      1     10      4    165        4.13
     40-44      1-24      0     21      4    130           0
     40-44      >=25      5     34      1     67        9.85
     45-49      None      3     20      2    155        11.6
     45-49      1-24      0     42      1     96           0
     45-49      >=25      3     31      2     50        2.42

Breslow-Days measure of homogeneity:  T = 15.1  (P = 0.299)

 Pooled Estimate:       OR     95% confidence limits
-------------------------------------------------------
Mantel-Haenszel       3.33         1.96,    5.65
Exact (ML-estimate)   3.26         1.92,    5.48

Two-sided P-value for OR = 1.00:P = 2.01E-005 (Exact,mid-P)
Again, exact confidence limits and P-values are computed.

Back To Top


Comments and suggestions, please mail: Anders Källén
Last modified: 98-09-12 9:00