Back to Navigation Page

Data Manipulation Procedures in Riemann


Since I most often work with data matrices in memory, the two most important procedures for data restruction are:

  • RESTRUCT - which is a sort of transposing within groups
  • MERGE - which merges two data matrices by key variables

These two procs were more or less what made me 'get going' with Gauss. It would be nice to extend them to handle data on disk as well. Such datasets would need to be sorted by key variables. To do this the following proc is provided, but yet not tested on a larger scale

  • DISKSORT - sorting dataset on disk on, possibly multiple, key variables. Provided the Gauss convention on naming of numeric and character data is followed. It requires that all key variable data can be read into memory.
  • DATASTACK - is another not much tested procedure operating on data on disk. It stacks datasets on top of each other.

There are essentially only two procedures within Riemann which manipulates datasets on disk that we have used to any extent. These are

  • BYTRANSFORMTODISK - this is a very versatile proc that can do most things with datasets on disk, even though it might by inefficient at times. The idea is that there is one grouping (key) variable by which the dataset is sorted. Then you define a function which tells you how to manipulate your data. What ByTransformToDisk does, is that each group is sent to the function, manipulated and the result is sent to the output dataset. A typical example is if you have multiple observations over time for a number of patients, and want to compute the mean value for each patient and save these on a new dataset. An extension to more than one key variable might be useful.
  • SELECTFROMDISK - which reads into a memory data matrix selected data from a dataset on disk. Selection can be a subset of variables and/or rows selected on values of one selection variable in the dataset.

I also include in Riemann, at least for the time being, a somewhat special proc. One problem with Gauss is that character fields can only be 8 characters long. This is in contrast to e.g. SAS datasets. When doing tabulations of special kinds (like disease history or medication history) there has been a need to bypass this restriction. The solution I have used is to save such information in an ASCII file with entries separated by @ or by , (one or the other!) and then use

  • MAKESTRINGARRAY - which converts this file into a string array, saved with extension .fst in the current library.

This procedure is more complicated than should be needed. This is because when I extract data from a database, ATOG is used to convert into Gauss datasets. And there are a few 'bugs' in ATOG that makes this more complicated than needed. A revision of ATOG might simplify this proc somewhat.

There are a few more procs relating to data manipulation:

  • INPROC - a simple proc which produced a boolean vector of 0 and 1:s if s specified vector takes one of a list of values.
  • CONCATENATE - the same as ~, but pads with missing if vectors don't have the same length.
  • REPLACEINSTRING - scans a string or string array and replaces any occurrance of a specified string with another specified string.
  • UNIQUEMC - an extension of UNIQUE that computes the key combinations for more than one key variable.
  • STRINGSORT - sort a string array on specific columns
  • F2CV - an analoge to FTOCV which handles missing the way I want
  • CV2F - an analoge of STOF which handles missings as I want and also handles string arrays.
  • CV2S - convert a character vector into a string, with insertion of a specified substring between elements
  • SA2STR - converts a string array to a string, where elements are separated by a user-provided string.
  • STR2SA - a converse to SA2STR which converts a string into a string array, where elements and rows are defined by special substrings.

The last three of these represent extensions to Gauss I would like to see incorporated in the appropriate Gauss procs, making my procs superfluous. Here is a list of other, similar, extensions that might be useful:

  • TRIMR2 - the same as TRIMR but returns a missing of one tries to trim more rows than available.
  • LOADD2 - the same as LOADD, except that it first checks if there is enought room to read in data. If not, it returns the name of the dataset instead of a loaded data matrix.
  • INVPD2 - inverts PD matrices with much smaller eigenvalues than INVPD
  • CHOL2 - a similar extension of CHOL

Comments, bug reports and suggestions to: Anders Källén
Last modified: 98-09-12 9:00