|
Basic Descriptive and Inferential Statistics in Riemann ![]() Descriptive and elementary inferential statistics is done by a number of procs in Riemann that are described here, together with short examples of their usage.
A few notations are common to all procs: Descriptive statistics:
Elementary statistics of the mean:
Other basic statistical methods:
ameanc, gmeanc, mediancLetx be a nxp-matrix. As such it represents p different
variables with up to n observations per variable (missings are allowed).
The command
means = ameanc(x);produces the p column arithmetic means, ignoring missings. Similarily for gmeanc (geometric mean instead) and medianc
(median).
FrequencyThe syntax of this proc is
{group,freqs,nms} = Frequency(dataset,vars,cvars);
and computes the number of occurrances of values in the columns of
dataset specified by vars. It can be done
by subgroups if these are specified by cvars. The dataset
can be both a data matrix in memory, in which case variables must be
specified by column numbers (negative for character data), or a Gauss dataset
on disk, in which variable names can be used.
The output is as follows:
__weight. This can be
either a column number/name or a vector defining individual row weights.
DescStatThe syntax of this proc is
{group,n,mean,std,min,max} = DescStat(dataset,vars,cvars);
The names of the left hand side defines what parameter is computed.
Each of them is GxP, where G is the number of subgroups defined by
cvars and P the number of variables. dataset
can be either a data matrix in memory or a Gauss dataset on disk. Variable
specification is either by column numbers or variable names. The output
variable group contains a row for each subgroup and
each row is the value combination from the variables in cvars
defining a subgroup. Negative numbers should be used to specify a
column of a data matrix containing character data.
Rows can be weighted using the global
The option
QuantilesThis proc is similar to built-in ones. It's syntax isy = Quantiles(dataset,vars,cvars,e);which computes quantiles of a dataset specified by vars.
What quantiles to compute is specified by the column vector e,
which should contain numbers between 0 and 1 (e.g., 0.5 is the median)
Computations can be done
by subgroups if these are specified by cvars. The dataset
can be both a data matrix in memory, in which case variables must be
specified by column numbers (negative for character data), or a Gauss dataset
on disk, in which variable names can be used.
Rows can be weighted using the global
This proc uses the empirical cumulative distribution function, a step function,
when computing the quantiles (so that the median is obtained in the standard
way). However, the option
SummaryStatisticsis the proc to use if you want to print a summary of statistics. It calls onDescStat and Quantiles. It's syntax is
out = SummaryStatistics(dataset,vars,cvars,e);The input is the same as for Quantiles, and the output matrix contains all requested statistics. However the main usage of this proc is to get printed summaries, which is done by having the global __output on. Weighing can be done by the global __weight.
If the option LOG is in effect, geometric means and
coefficient of variations (as percentage) will be computed instead of arithmetic
mean and standard deviation.
MeanCIcomputes confidence limits and (optionally) P-values for the mean, possibly within subgroups (defined by the variables incvars). Its syntax is
out = MeanCI(dataset,vars,cvars);A printed result is obtained by having __output on. The output
is a matrix, which might need some work to identify its contents (this is
left to the user). If the option LOG is in effect, confidence
limits are given for the geometric mean. If the option P-value
is on, P-values are also computed. The confidence degree is determined by
the global _rmn_Conflevel, for which the default is .95.
Finally, weighing can be done by the global __weight.
MeanDifferenceCIcomputes confidence intervals and (optionally) P-values for the difference of the mean for two groups. Analysis can be done in subgroups. Its syntax isout = MeanDifferenceCI(dataset,vars,group,cvars);Here the input group defines the group belonging, wheras
possible subgroups within which the comparisons should be done is defined
by cvars.The output
is a matrix, which might need some work to identify its contents (this is
left to the user). If the option LOG is in effect, confidence
limits are given for the geometric mean. If the option P-value
is on, P-values are also computed. The confidence degree is determined by
the global _rmn_Conflevel, for which the default is .95.
Finally, weighing can be done by the global __weight
FiellerIntervalcomputes the Fieller Interval for the ratio of the mean of a bivariate normal distribution. Its syntax isout = FiellerInterval(means,cov,df,lbl);Here means is a vector with two estimated means, and cov
the covariance matrix of these estimates, df the degrees of freedom
and lbl a string specifying a name for the ratio in an output. The
output vector contains mean ratio with confidence limits. Confidence limit is
controlled by the global _rmn_Conflevel.
Correlationcomputes the correlation coefficient from a Nx2-data matrixdata:
out = correlation(data);By default a standard Pearson correlation coefficient with approximative confidence limits is computed, together with the P-value corresponding to the hypothesis no correlation. With the option Spearman in effect, data is ranked
within column before the analysis is done. This is the non-parametric spearman
correlation coefficient.
BinomialProportionstakes a 2-by-2-table and considers the rows to be frequencies associated with two independent binomial distributions. Various combinations of the two binomial proportions with confidence intervals based on the likelihood: the difference, the ratio (relative risk), the odds ratio and the excess risk. The syntax isstat = BinomialProportions(table);Confidence degree can be specified by _rmn_ConfLevel.
Example: let table = 14 43 6 67; call BinomialProportions(table);produces the following output: Parameter estimate 95% confidence limits P-value --------------------------------------------------------------- Proportion 1 0.246 0.147, 0.367 Proportion 2 0.0822 0.0335, 0.16 Difference 0.163 0.0386, 0.297 Excess Risk 0.178 0.0443, 0.317 Relative Risk 2.99 1.29, 8 Odds Ratio 3.64 1.35, 10.9 0.01Note that there is an iterative algorithm involved, so the proc may take time (or even bog down).
MantelHaenzelORThis proc estimates a stratified odds ratio from a sequence of 2-by-2-tables. Computes MantelHaenzels estimate with confidence limits and the ML estimate with exact confidence limits. As default Breslow-Days test for homogeneity is given, though the user can request Zelens test instead for small tables. The syntax isqout = MantelHaenszelOR(dataset,dep,indep,covars);There are basically two different ways of inputting data
Example 1:
AddOption("Exact"); @ We want exact CIs @
data = {2 8 10 80, 4 6 15 75, 3 12 5 80, 5 10 10 75, 6 9 5 80};
call MantelHaenszelOR(data,0,0,0);
computes a common odds for 5 2-by-2-tables and gives the following output:
Individual Odds Ratios:
Strata a b c d OR
--------------------------------------------------------
1 2 8 10 80 2
2 4 6 15 75 3.33
3 3 12 5 80 4
4 5 10 10 75 3.75
5 6 9 5 80 10.7
Breslow-Days measure of homogeneity: T = 2.75 (P = 0.6)
Pooled Estimate: OR 95% confidence limits
-------------------------------------------------------
Mantel-Haenszel 4.15 2.22, 7.73
Exact (ML-estimate) 4.1 2.18, 7.62
Two-sided P-value for OR = 1.00:P = 2.53E-005 (Exact,mid-P)
Note that in this case confidence limits for Mantel-Haenzels estimate is
exact. The ML-estimate is always 'exact' in the likelihood sens.
Example 2: 1-SMOKING 2-Use of UC 3-Age Group 4-Group (1=MI/2=Ctl) 5-NumberA traditional Fisher exact test (one table only) can be done as follows: __weight=5; call MantelHaenszelOR(data,2,4,0);which gives the following result Estimate: OR 95% confidence limits ------------------------------------------------------- Mantel-Haenszel 1.68 1.1, 2.58 Exact (ML-estimate) 1.68 1.08, 2.56 Two-sided P-value for OR = 1.00: P = 0.0216 (Exact,mid-P)A stratified analysis can be done as follows:
let lbls = "25-29" "30-34" "35-39" "40-44" "45-49";
data[.,3] = recode(data[.,3],data[.,3] .== seqa(1,1,5)',lbls);
let string _rmn_tab_replace = {2 "None@1-24@>=25"};
let string _FactorNames = "Age" "Smoking";
call MantelHaenszelOR(data,2,4,-3|1);
Note that the second covariate here (smoking) has three numeric
classes which are recoded by the global _rmn_tab_replace,
whereas for the age class variable we choose to actually recode
it from numeric data to character data. The output is as follows:
Individual Odds Ratios:
Age Smoking a b c d OR
------------------------------------------------------------
25-29 None 0 1 25 106 0
25-29 1-24 1 0 25 79 +INF
25-29 >=25 3 1 12 39 9.75
30-34 None 0 0 13 175 .
30-34 1-24 1 5 10 142 2.84
30-34 >=25 8 7 10 73 8.34
35-39 None 0 3 8 153 0
35-39 1-24 1 11 11 119 0.983
35-39 >=25 3 19 7 58 1.31
40-44 None 1 10 4 165 4.13
40-44 1-24 0 21 4 130 0
40-44 >=25 5 34 1 67 9.85
45-49 None 3 20 2 155 11.6
45-49 1-24 0 42 1 96 0
45-49 >=25 3 31 2 50 2.42
Breslow-Days measure of homogeneity: T = 15.1 (P = 0.299)
Pooled Estimate: OR 95% confidence limits
-------------------------------------------------------
Mantel-Haenszel 3.33 1.96, 5.65
Exact (ML-estimate) 3.26 1.92, 5.48
Two-sided P-value for OR = 1.00:P = 2.01E-005 (Exact,mid-P)
Again, exact confidence limits and P-values are computed.
Comments and suggestions, please mail: Anders Källén |