|
|
Data
Manipulation Procedures in Riemann

Since I most often work with data matrices in memory, the two
most important procedures for data restruction are:
- RESTRUCT - which is a sort of transposing within groups
- MERGE - which merges two data matrices by key variables
These two procs were more or less what made me 'get going'
with Gauss. It would be nice to extend them to handle data on
disk as well. Such datasets would need to be sorted by key
variables. To do this the following proc is provided, but yet not
tested on a larger scale
- DISKSORT - sorting dataset on disk on, possibly multiple,
key variables. Provided the Gauss convention on naming of
numeric and character data is followed. It requires that
all key variable data can be read into memory.
- DATASTACK - is another not much tested procedure
operating on data on disk. It stacks datasets on top of
each other.
There are essentially only two procedures within Riemann which
manipulates datasets on disk that we have used to any extent.
These are
- BYTRANSFORMTODISK - this is a very versatile proc that
can do most things with datasets on disk, even though it
might by inefficient at times. The idea is that
there is one grouping (key) variable by which the dataset
is sorted. Then you define a function which tells you how
to manipulate your data. What ByTransformToDisk does, is
that each group is sent to the function, manipulated and
the result is sent to the output dataset. A typical
example is if you have multiple observations over time
for a number of patients, and want to compute the mean
value for each patient and save these on a new dataset.
An extension to more than one key variable might be
useful.
- SELECTFROMDISK - which reads into a memory data matrix
selected data from a dataset on disk. Selection can be a
subset of variables and/or rows selected on values of one
selection variable in the dataset.
I also include in Riemann, at least for the time being, a
somewhat special proc. One problem with Gauss is that character
fields can only be 8 characters long. This is in contrast to e.g.
SAS datasets. When doing tabulations of special kinds (like
disease history or medication history) there has been a need to
bypass this restriction. The solution I have used is to save such
information in an ASCII file with entries separated by @ or by ,
(one or the other!) and then use
- MAKESTRINGARRAY - which converts this file into a string
array, saved with extension .fst in the current library.
This procedure is more complicated than should be needed. This
is because when I extract data from a database, ATOG is used to
convert into Gauss datasets. And there are a few 'bugs' in ATOG
that makes this more complicated than needed. A revision of ATOG
might simplify this proc somewhat.
There are a few more procs relating to data manipulation:
- INPROC - a simple proc which produced a boolean vector of
0 and 1:s if s specified vector takes one of a list of
values.
- CONCATENATE - the same as ~, but pads with missing if
vectors don't have the same length.
- REPLACEINSTRING - scans a string or string array and
replaces any occurrance of a specified string with
another specified string.
- UNIQUEMC - an extension of UNIQUE that computes the key
combinations for more than one key variable.
- STRINGSORT - sort a string array on specific columns
- F2CV - an analoge to FTOCV which handles missing the way
I want
- CV2F - an analoge of STOF which handles missings as I
want and also handles string arrays.
- CV2S - convert a character vector into a string, with
insertion of a specified substring between elements
- SA2STR - converts a string array to a string, where elements
are separated by a user-provided string.
- STR2SA - a converse to SA2STR which converts a string into
a string array, where elements and rows are defined by special
substrings.
The last three of these represent extensions to Gauss I
would like to see incorporated in the appropriate Gauss procs,
making my procs superfluous. Here is a list of other, similar,
extensions that might be useful:
- TRIMR2 - the same as TRIMR but returns a missing of one
tries to trim more rows than available.
- LOADD2 - the same as LOADD, except that it first checks
if there is enought room to read in data. If not, it
returns the name of the dataset instead of a loaded data
matrix.
- INVPD2 - inverts PD matrices with much smaller
eigenvalues than INVPD
- CHOL2 - a similar extension of CHOL

Comments, bug reports and suggestions to: Anders Källén
Last modified: 98-09-12 9:00
|