GAS Manual

--{- General User Guide v2.3 -}--

(c) Alan Young, 1993-1998.

# Introduction

This is the general manual for the Genetic Analysis System version 2.3 which assumes you have already read the introductory notes. The analysis modules are described in a companion volume. To supplement this manual there are a series of demonstration examples available in the form of computer files, and it is highly recommended that these are used in conjunction with the text.

## Contents

Chapters: Appendices:
• Control Variables
• Installing gas
input=gintro.tex

# Running the Program

The gas program is controlled by giving it a list of commands contained in a file called a gasfile' (this is analogous to a BAT' file on PCs, or a COM' file on VAX systems). * These commands describe where to look for data and what to do with it.

## Order of Input

The program requires that the following data be entered:

1. Locus Specifications
2. Control Variables
3. Pedigree Specifications
4. Modifications required
5. Analysis Routines to be used

Comments may be placed in any input files by preceding them with an exclamation mark (!), which causes the program to ignore the remainder of the current input line. The letter x is usually used to mark where data is unknown. Individual program commands may be written over several lines, and should end with a semi-colon~:

### Locus Specification

This is the specification and labelling of the loci types and other variables which have been measured for individuals in the pedigree. This is described in Chapter 4.

### Control Variables

These variables modify the performance and output of the program and are altered using the set command. They are described in Chapters 6 and 8, with a summary given in Appendix C.

### Pedigree Specification

For each subject there must be an entry indicating to which family they belong, their parents, sex, and what is known about their genotypes and phenotypes. This is described in Chapter 5.

### Modifications Required

gas can modify the contents of a dataset to make it suitable for different purposes. The methods for doing this are described in Chapters 9, 10 and~11.

### Analysis Routines

This is a list of commands telling the gas program what sort of analysis to perform on the data. These are described in a companion document to this manual.

## On-line Help

gas has a basic help facility built into the program. Typing
gas help
will give a list of the main commands used by the program. When running the program, a reminder of the parameters for most of the commands and routines can be obtained by replacing their normal argument with help. For instance set help; and call sibdes( help ); will respond by listing the parameters that may be used with the set and sibdes commands respectively.

#### Example:

Below is a basic gasfile which loads in a dataset and runs a sib-pair analysis on it. The file is called basic.gas and you can run it by typing gas basic at the command line of your computer.

   set logfile = basic.log;                  !  1
set outfile = basic.out;                  !  2
read( data basic.loc );                   !  3
read( pedigree basic.ped );               !  4
program;                                  !  5
call sibdes( locus atopy mk1 mk2 );       !  6
stop;                                     !  7


The lines in this control file perform the following actions:

1. A logfile is opened. A copy of any messages generated by gas will now be copied to the file basic.log.
2. An outfile is opened. The output from the following sib-pair analysis will be sent to the file basic.out.
3. Data describing the loci involved is read from the file basic.loc.
4. Data on the pedigree structure is loaded from the file basic.ped.
5. The program command tells gas to expect a list of analytical operations to be performed.
6. The sibdes routine is called to perform an IBD sib-pair analysis on the loci atopy, mk1 and mk2 (which are described in the data files).
7. The stop command marks the end of instructions to \gas.

After running this example look at the files it used (basic.loc and basic.ped) and created (basic.log and basic.out). As with all the examples in the manual, feel free to modify any of the files to see how the output changes - if you enter anything that doesn't make sense to the gas program, it will tell you where to look in the input files for mistakes (the filename and line-number will appear in the logfile).

# Data Input and Output

The first stage in any analysis is to load data into the program using the read command.

gas is able to read locus and pedigree data in two main formats. The preferred format is a keyword based system, called g-format'. An alternative is that used in the LINKAGE suite of programs denoted by l-format' *. Some of the exercises later in this manual show how to convert data between these. To load data into gas use the read command thus:

where file_type describes the type of data being loaded (the options are listed below) and file_name(s) is a list of one or more files from which data is to be taken.

file typecontents
datalocus specifications in g-format\cr
ldatalocus specifications in l-format\cr
pedigreefamily data in g-format\cr
lpedigreefamily data in l-format\cr
alsizefamily data using CA-repeats\cr

Hence the command

will read family data in g-format from the file bigfam.ped. If you wish to read and/or write in a directory other than the current one, then you must enclose the path and file name in quotation marks, thus

would (under unix) direct gas to go up one directory level then look in the sub-directory raw for the file bigfam.ped.

The pedigree and lpedigree formats are discussed in Chapter 5, and the alsize format * is discussed in Chapter 7.

## Writing Data

The write command may be used to send locus and pedigree held inside gas into files for long term storage or processing by other programs. The basic format is

write( file_type file_name );

where file_type is one of:

file typecontents
datalocus specifications in g-format
ldatalocus specifications in l-format
pedigreefamily data in g-format
lpedigreefamily data in l-format
programanalysis commands

and file_name is the name of the file into which the data is put. For the first four file types, further control of the output is available using the locus parameter thus

write( filetype file_name locus locus_name(s) );

which causes only those loci whose names are listed to appear in the new file (subsets of data may also be written using the edit, select and delete features described later).

#### * Example *

The gasfile io.gas reads l-format data from the files io.dat and io.fam. It writes all of it in g-format to the files io1loc.new and io1ped.new, then only those loci corresponding to phenotypic data are written to the files io2loc.new and io2ped.new.

### Nested Files

Because an input file of type data is simply a list of gas instructions, it may be used to load further files, and these files may also load further files, etc. The include command may also be used for this - the syntax being

include filename;

A convenient way to organise the input files is to keep the genotype specification, the pedigree data and the program commands in three separate files (eg. locs.gas, fam.gas and prog.gas respectively), each of which may load further files. The read command can then be used in the main file to input the contents of each of these separate files.

   read( data locs.gas );


Using this hierarchical organisation, only the prog.gas file would need to be altered to perform different operations on the data. Similarly, any pre-written commands could also be stored separately. Schematically this organisation looks like

top-levelsub-filesub-file 2description
main.gas locs.gasgenotype specifications
fam.gas fam1.ped1st family
famN.ped}Nth family
prog.gas alpmain.gasmain program commands
alpsub.gassubroutine commands

Alternatively, since the top' three files are all gasfiles, the file main.gas could be omitted and the same effect achieved by typing

gas locs fam prog

on the command line.

# Describing Loci

Before any data on individual subjects can be loaded into gas it is necessary to specify what the data will consist of - ie.the names and types of variables which have been measured. Usually this will consist of a list of marker loci together with one or more hypothesized phenotype-influencing loci thought to be relevant to the genetic situation.

The specifications for each of the loci types (affection, binary, quantitative and named) follow a common pattern:

   set locus locus_name
type_of_locus
number_of_alleles  gene_frequencies ...
further_specifications...;


Note that in your input file, although the parameters must appear in the same order as listed here, they can be split differently across lines of the file.

The newlocus command can be used to construct additional loci with values calculated according to various criteria - see Chapter 11 for more details on this.

## General Parameters

These parameters may be specified for all types of locus.

If the sexlinked qualifier is given after a particular locus, then the data for that locus is assumed to be sex-linked, and it may be necessary to enter additional parameters to describe this behaviour.

### Eqfreq

The eqfreq qualifier may be used when the number of alleles is known, but their frequencies are irrelevant. For instance

8 eqfreq

is equivalent to entering

8 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125

## Affection

Affection loci are used to describe a simple phenotype in which a subject is either affected or non-affected by some condition. Subjects may be divided into a series of liability classes corresponding to distinct groups within a population, within which the penetrance of the phenotype associated with the locus is assumed to act differently. The format required to specify an affection locus is:

   set locus locus_name affection
number_of_alleles gene_frequencies...
number_of_liability_classes
name_of_liability_class_1   penetrances...
.
name_of_liability_class_N   penetrances...


If the locus has only one liability class then it's name is omitted from the specification. In cases where the liability classes and penetrances are irrelevant to the type of analysis being performed (currently only the dissect and lik...' modules use the penetrance data) the specification may be omitted by replacing with with the word noclass, which is equivalent to specifying one class with uniform penetrances of 0.5. thus

   set locus locus_name affection
number_of_alleles gene_frequencies...
noclass;


### Affection Loci Penetrances

Suppose a locus has n alleles and label the penetrance corresponding to a subject having i and j alleles as pi,j. Then the penetrances can either be entered in triangular order as

   name_of_liability_class
p1,1 p1,2 ... p1,n-1 p1,n
p2,1 p1,2 ... p2,n-1
p3,1 ...
.
pn,n


or alternatively in full square format in the order

   name_of_liability_class
p1,1 p1,2 ... p1,n-1 p1,n
p2,1 p2,2 ... p2,n-1 p2,n
.
pn,1 pn,2 ... pn,n-1 pn,n


and gas will automatically deduce which has been used by counting the number of entries given. The square format may be used to describe an imprinted locus when pi,j differs from pj,i. In such cases pf,m is taken to be the penetrance when a subject inherits the f allele from it's father and the m allele from it's mother.

For instance, to describe an affection locus, named a1', with 2 alleles (of frequencies 0.75 and 0.25) and 3 non-imprinted liability classes (labelled 1, 2 and 3), enter the following:

   set locus a1 affection
2 0.75  0.25
3
1   0.3  0.79  0.90
2   0.3  0.90  0.95
3   0.5  0.13  0.37;


If this locus had only the first liability class, then it's specification would be:

   set locus a1 affection
2  0.75  0.25
1
0.3  0.79  0.90


If the subject's trait value or liability class is unknown, all the penetrances are taken to be 1.

For sex-linked data a further series of values must be entered for each liability class to describe the penetrances in male subjects (the first data is assumed to refer to females with 2 chromosomes). For each liability class, a line is required of the form

name_of_liability_class p1 p2 ... pn

where pi is the penetrance of the condition in a male subject having the single allele i. Thus for a non-imprinted sex-linked locus having 3 alleles and 2 liability classes, called child and adult enter the following:

   set locus a1 affection
3 0.35 0.4 0.25
2
child    0.3   0.79  0.90  0.35  0.64  0.90
adult    0.3   0.90  0.95  0.20  0.50  0.64
child    0.3   0.65  0.00


again, omit the name of the liability class if there is only one. If the noclass qualifier is used, then sexlinked should be entered immediately after it.

## Binary

Binary loci are used to describe a phenotype in terms of the presence or absence of several criteria (called factors) simultaneously. The format required to specify a binary locus is:

   set locus locus_name binary
number_of_alleles  gene_frequencies...
number_of_factors
code_for_allele_1...
.
code_for_allele_N...;


where n is the number of alleles. For instance, to describe a binary locus, named bn1, with 4 alleles and 3 factors, enter the following:

   set locus bn1 binary
4 0.35  0.25  0.1  0.3
3
1  0  0
0  1  0
0  0  1
0  1  1;


Thus the line 1 0 0 specifies that subjects containing a copy of allele 1 will be positive for the first factor but not for the second and third factors (though they may have these latter two due to their other allele).

Hence a subject who is positive for the first and third factors and definitely negative for the second factor must have genotype 1 3. However if their status for the second factor was unknown then the subject might have either genotype 1 3 or 1 4.

## Quantitative

Quantitative loci are used to store phenotypic variables which cannot be classified as simply present or absent (for instance, the height of a subject). The format required to specify a quantitative locus is:

   set locus locus_name quantitative
number_of_alleles  gene_frequencies...
number_of_liability_classes
name_of_liability_class_1
penetrance_distributions_for_class_1...
.
name_of_liability_class_N
penetrance_distributions_for_class_N...;


The parameters from number_of_liability_classes onwards are only used by the lik... routines, and if these are not being used they can be replaced by the word noclass when creating a new locus. For instance, to describe a quantitative locus, named qval, with 2 alleles and no liability class information enter the following:

   set locus qval quantitative
2 0.65  0.35
noclass;


### Quantitative Loci Penetrance Distributions

gas has two forms for penetrance distributions (if you don't know what these are, then get statistical advice before using them!) based around constant and normal* distributions. To specify that the penetrance associated with a genotype is independent of the quantitative value, use the syntax

constant value

and to specify that a penetrance has a normal distribution (based on the subject's measured trait value) use the syntax

normal mean variance

The penetrances associated with each genotype are entered in the order described above for affection loci penetrances. gas supports multiple liability classes for quantitative traits {the LINKAGE programs support only one class} and to describe a quantitative locus, called qval1, with 3 alleles and only one liability class (in which case the class name is omitted) you might enter the following:

   set locus qval1 quantitative
3 0.25  0.4 0.35
1
normal   22.3  5
normal   13.5  2.6
normal   23.1  1.64
constant  0.3
normal    7.3  14.5
constant  0.5;


which shows that the penetrance associated with the genotype 1 2 has a normal distribution with mean 13.5 and variance 2.6, and the penetrance associated with genotype 3 3 has a fixed value of 0.5.

Similarly, to describe a quantitative locus, named qval2, with 2 alleles and 2 liability classes (labelled black and blue) you might enter the following:

   set locus qval2 quantitative
2 0.25 0.75
2
black   constant   0.2
constant   0.2
constant   0.95
blue    normal   -25    12.4
normal     0.0  14
normal    24    12.4


If the subject's trait value or liability class is unknown, all the penetrances are taken to be 1.

## Named

A named locus is one in which the alleles are definitely known and have individual names (or numbers). This classification is used to describe such things as markers, CA-repeats and l-format numbered loci. The specification for a named locus is:

   set locus locus_name named
number_of_alleles gene_frequencies...
optional_data...;


For instance, the basic way to describe a locus n1 with 6 alleles (individually identified as 1, 2, 3, 4, 5, and 6) is

   set locus n1 named
6 0.1 0.15 0.05 0.4 0.22 0.08;


The optional_data may be the names of the individual alleles (name), their size (size) (used when alleles are distinguished by their length in terms of DNA bases), variation around mean size (range), minimum size (minsize), their maximum size (maxsize), or the sexlinked, nodata and nofreq parameters.

### Allele Names

By default it is assumed that the alleles are labelled from 1 to the maximum number. However the name parameter may be used to give the alleles other labels, hence to describe a locus (called gre) in which the alleles are labelled as the first six greek letters, enter:

set locus gre named 6 0.1 0.15 0.05 0.4 0.22 0.08 name alpha beta gamma delta zeta eta;

If the sexlinked parameter is given then the read( pedigree... ); command will only expect one allele name for each male subject rather than the normal two.

### Nodata

The nodata qualifier may be given after the frequencies. In this case no input data for this locus is expected and all subjects have the value set as unknown (x).

This qualifier is useful before read( alsize ... ); to set up a blank locus in every subject which will be filled as new values are read.

### Nofreq

The nofreq qualifier may be used with nodata when the frequency and number of alleles are unknown (generally when loading allele-size data with the read( alsize ... );. It is equivalent to specifying two alleles of equal frequency.

### Size and Range

The alleles of some named loci may be distinguished by their physical length along a chromosome, and the gas program can read data in terms of such measurements using read( alsize ... );. If the expected lengths of the alleles are known prior to loading pedigree data, then they may be entered using either the size/range or minsize/maxsize options.

The size qualifier is used to describe the average length of each of the named alleles, and the range qualifier describes the range about each mean value which is likely to be measured. Hence

   set locus  ns named
3 0.3 0.2 0.5
size   120.7  122.5  130.0
range    0.5    0.5    0.8
nodata;


describes a named locus ns with three alleles, the sizes of which are $120.7\pm0.5$, $122.5\pm0.5$ and $130\pm0.8$ respectively. The nodata parameter means that no data will be expected in files read using read( pedigree ... ); or read( lpedigree ... );.

### Minsize and Maxsize

The minsize and maxsize qualifiers are used to describe the minimum and maximum sizes of each of the alleles present. Hence

   set locus nm named
2        0.3    0.7
name     bigun  littlun
minsize  140.7  122.5
maxsize  141.8  123.9;


describes a locus with 2 alleles, called bigun and littlun which have lengths in the ranges 140.7 to 141.8 and 122.5 to 123.9 respectively.

#### * Example *

The file xloc.gas contains descriptions of each type of locus in g-format, and the file xloc.ped shows an example family corresponding to them. Run gas to write out the locus data in in l-format to xlocl.new and xlocped.new. Note the use of edit(lformat) to convert alphabetic labels to numbers, which stores a table of correspondences in the logfile. Warnings will be given for variables in which the information content has to be reduced in order to convert them to l-format.

#### * Example *

The file xlocl.gas reads in the l-format data produced in the previous example, and writes it to the g-format files xlocg.new and xlocgped.new. Compare these files with the input files in the previous example.

# Describing Subjects

After the type of input data has been described, either by entering it directly into the gas control file or else by loading it in using read( data ... ); or read( ldata ... );, the information for the actual subjects must be loaded. This is done by placing it in one or more files and reading them using either read( pedigree ... ); or read( lpedigree ... ); or (to load data in terms of allele sizes, see Chapters~3 and~7). Note that l- and g-format data can be mixed, but not in the same file. The general syntax used to describe a subject is:

ped_name name parent_1 parent2 sex loci... optional_modifiers...

Note that all of the data for a single subject must appear on the same line of the input file, and that the loci must be listed in the same order they were specified earlier (to get round this latter restriction see the locus parameter in Chapter~7).

## Relationship

The first four entries on each line correspond to:

columnmeaning
1 the family of which the subject is a member,
2 the name of the subject,
3, 4 the names of the subject's parents.

These names can any combination (up to 16 characters long) of the letters a-z, numbers 0-9 and the underscore _' character, however if they begin with a number then they must be wholly numeric *. If a parent is unknown (ie.~not listed elsewhere in the pedigree) then it's name should be replaced with x.

## Sex

The 5th column describes the sex of the subject and is either m for males, or f for females.

## Loci

For each of the loci described in the genotype specification stage there must be a corresponding entry for each subject (unless the nodata qualifier was used, in which case the entry for that locus must be omitted).

### Affection Loci

The format for an affection locus with more than one liability class is:

status liability_class

where status is one of y, n or x. If there is only one liability class this is reduced to

status

For example, suppose locus baldness has classes blonde, auburn and brown, then to say that a subject has brown hair and is affected, the following should be entered:

y brown

For an non-affected person in the same liability class (ie. brown) the entry is:

n brown

If the status is unknown, the subject is described as:

x brown

### Binary Loci

A binary locus is given in terms of the results applied (ie.~the factors), where these have the three possible results positive (y or 1), negative (n or 0) and unknown/not-tested (denoted unknown). The format for a binary subject locus with N factors is thus

result1 result2 ... resultN

Hence for a locus called test_result in which factors 1 and 5 are definitely negative, factors 2 and 3 are definitely positive and factor 4 is unknown (either because it was not tested, or the results were ambiguous) enter the following:

n y y x n

or alternatively

0 1 1 x 0

### Quantitative Loci

A quantitative locus is specified in terms of the quantity it measures (and the liability class if more than one) hence the format is:

quantity liability_class

Thus for a locus having value 12.5 for a particular subject and only one liability class the entry would be:

12.5

For a locus having value 41.6 for a particular subject, who is in liability class wide', the entry would be:

41.6 wide

The symbol x is used if the quantitative value or (if there are more than one) liability class is unknown. Hence

x x

denotes a subject for whom both the value and liability class are unknown.

### Named Loci

A named locus is one in which the individual alleles can be identified by some direct method (see also the section on the alsize format). The format for a named subject locus is:

allele_number_1 allele_number_2

Hence for a locus having alleles 1,...,6 a subject having alleles 3 and 4 would be described as:

3 4

If both the alleles were unknown, the entry would be:

x x

If only one allele was identified definitely (as 5 say), then the entry would be:

5 x

If the alleles were labelled big, small and tiny then a genotype entry might be:

big small

If the locus has been specified as sex-linked, then only one allele should appear in the input pedigree file for male subjects.

## Modifiers

Subjects may be tagged' as having special properties by adding particular keywords after their genotypic and phenotypic data.

### Loopbreak

Some types of analysis (eg.~lodscore calculations) require that any loops within a pedigree should be marked, and that a person be selected at which to break' them. This can be done using the loopbreak keyword in the subjects description thus:

1 1 6 7 m rest_of_parameters... loopbreak

Alternatively, it may be selected by entering the following:

set loopbreak = family_name subject_name;

where family_name is the name of a family and subject_name is the name of a subject within it.

### Proband

The proband for a pedigree may be selected by placing the word proband after the input data for one of it's members, thus

1 1 6 7 m rest_of_parameters... proband

Alternatively, it may be selected by entering the following:

set proband = family_name subject_name;

where family_name is the name of a family and subject_name is the name of a subject within it. If no proband is given then gas chooses an individual in each family so as to maximise the speed of computation.

## Validation and Relation

Once all the data is entered, it is analysed to eliminate the vast majority of common genotyping and data-entry errors. However some rare cases of inconsistent inheritance may be missed in large extended pedigrees\fc {A catch-all' routine which performed full genotype elimination for highly polymorphic loci in such families could take months to run!} which contain several fully untyped subjects.

### Consistency

Pedigrees are checked to ensure that
1. parents are of opposite sex
2. all members of a family are related
3. each child is consistent with it's parents
4. each parent is consistent with all it's children
Check 2 can be disabled using the command

set checkrelated = n;

which may be useful when calculating population based statistics in which the dataset does not consist of complete families. However certain types of analysis (eg. the lik... routines) cannot be carried out unless all of the subjects of a family are related to each other via members listed in the pedigree files.

### Loops

Each family is checked for the presence of loops' - ie.individuals who are related by more than one pathway of descent (through actions such as incest or multiple marriage).

Any complete loop found is listed, together with suggestions for modifying the pedigree by breaking the loop at certain individuals. By default the program will prompt the user for the name of a subject at which to break each loop, however the command

set autoloop = y;

will allow gas to automatically break any loops. If any loops are found then a file loop.gas will be created listing the chosen breakpoints, and this file may be read into gas the next time the program is run by using

after the subject data has been loaded. Loop checking can be disabled by the command

set checkloop = n;

### Unmake

The lpedigree format can read data which has been processed using the makeped* program by including the parameter unmake. Hence to read the file pedin.dat which was generated from an l-format file using makeped, enter the following

Note that if the original pedigree contained loops then gas will attempt to merge the subjects which were duplicated by makeped'.

## Options for Reading PEDIGREE Data

Two parameters exist which modify the input of family data:

typeparameterdescription
overwriteover-write existing data

The read( pedigree ... ); command can selectively load subsets of data using the locus parameter. For instance, suppose loci alp, bet, gam, del, and eps are entered into the locus specifications, but only data for the first two are stored in the file partial.ped, then (as explained in the next section, this is not necessary in the g-format pedigree file was previously created by gas) use the command

read( pedigree partial.ped locus alp bet );

which will load in these two loci and leave any values held for the other loci unchanged. The locus parameter is also used if the loci are listed in the pedigree file in a different order to that in which they were specified earlier using the set locus command.

Normally gas will give an error message if data is read in which would erase previously held information (for instance by loading the same subject twice). If overwrite is included in the list of parameters then no message will be given and the second set of data will be replaced by the first.

### Automatic Selection of Loci

A g-format pedigree file which has been produced by gas will contain a descriptive line of the form

pedigree locus locus_name(s)...

gas will read this and use it to determine which loci are to be loaded from the file (unless the locus parameter is used as above, which over-rides this automatic selection). A warning will be given if the pedigree file contains any loci which were not specified earlier using set locus.

#### * Example *

The file autoload.gas reads the pedigree files au_even.ped, au_odd.ped, au_1to5.ped and au_6to10.ped. The combined dataset is written to au_ped.new.

# Control Variables

The behaviour of gas can be altered using a number of control variables, which are modified by using the set command in the input gasfile thus:

set variable_name = new_value;

Most of these variables have default values which are used unless the user tells the program otherwise.

## Output Control

### Logfile

The diagnostic output from the program can be written to a file for examination after the run is finished. This will be especially useful if a number of warnings or errors have been produced during the run. A logfile with name file.ext can be opened by the command

set logfile = filename;

If the filename does not have an extension (in the filename fred.dat the 'dat' part is the extension) then log is assumed. Only one logfile can be created per program run, any subsequent set logfile commands are noted and ignored.

### Outfile

The outfile is used for results produced by the program, and anything produced by the fprintf command. To open an outfile give the command

set outfile = filename;

If no extension is present, then out is assumed. Every time this command is given a new output file is opened (and any previous one is closed). If results are generated before an output file has been opened, then the file gas.out is created and used to store the data.

### Graphics

Some of the routines in gas provide graphical output. To enable this option, open a file using the command

set psfile = filename;

For more details see Chapter~8 on Postscript Graphics later in this manual.

### Allfile

The allfile is equivalent to separately setting logfile, outfile and psfile using the same parameter each time. Thus the command

set allfile = filename;

opens the files filename.log, filename.out, and filename.ps. Note that the file name parameter must not have an extension or dot .' within it.

### fprintf

The fprintf utility can be used to display messages and put additional text into files after the program command has been given. fprintf uses a syntax based on the ANSI C' fprintf routine. The first parameter determines where the message is sent, and must be one (or more) of:

letteraction
owrite to outfile
lwrite to logfile
swrite to screen

Hence the command

fprintf( os, "\n\nhello, world" );

sends a blank line followed by the text "hello, world" to both the outfile and the screen.

### Verbosity

The level of diagnostic information produced by the program (and sent to the screen and logfile) can be modified using the verbosity variable thus

set verbosity = level;

where the level is an integer (ie. a whole number) in the range 0-3, and larger values produce more information. The default value is 1.

#### * Example *

The gasfile fpdemo.gas creates an outfile fpdemo.out and logfile fpdemo.log and selectively sends information to them and the screen.

## Consistency Checking

Several parameters may be used to alter the behaviour/output of the consistency checking routines:

### Autoloop

The autoloop variable determines whether gas automatically breaks any loops it finds. It takes the values y or n, and has default value n so that the user will normally be asked to select subjects for breaking any loops found.

### Checkloop

The checkloop variable determines whether gas checks each family for the presence of loops. It takes the values y or n, and has default value y so that loops will normally be searched for.

### Checkrelated

The checkrelated variable determines whether gas checks that all the members of each family are related. It takes the values y or n, and has default value y so that normally every family will be checked to ensure that all of their members are related.

### Maxerrors

The maxerrors variable determines how many ERROR status messages gas will display before asking if the user wishes to terminate the run. The default value is~5.

### Maxwarnings

The maxwarnings variable determines how many WARNING status messages gas will display before asking if the user wishes to terminate the run. The default value is~5.

## Loci

The sexlinked command can be used to make loci sex-linked (ie.~part of the X-chromosome) by default. Note that this only affects loci declared after it has been altered. The syntax is

Using sexlinked = n; restores the normal default for any loci declared subsequently.

## Subjects

### loopbreak

The loopbreak command can be used to designate a particular subject as being a suitable point at which to break a loop. The syntax is

set loopbreak = pedigree_name subject_name;

### proband

The proband command can be used to designate a particular subject as being the proband for a pedigree. The syntax is

set proband = pedigree_name subject_name;

### lqunknown

The l-format uses a particular numeric value to denote when the value for a subject at a quantitative locus is unknown (it has the unfortunate default of 0.0). The lqunknown variable can be used to alter this value both for reading and writing files. The syntax is

set lqunknown = value;

so that entering set lqunknown = -99 will cause gas to mark any l-format subject with quantitative value -99 as being unknown, and similarly any unknown value will be written out as -99 when using write( lpedigree );

The categorizing of subjects as having unknown values is done during the read process, so subsequently altering lqunknown will not affect any subject data that has already been loaded. Hence it is possible to read an lpedigree file then change the value of lqunknown and re-write the original file with the new unknown' value substituted for the original one.

# Allele-size Data Input

Automated genotyping using fluorescent markers is becoming increasingly common. The gas program is able to read genotypic data given in terms of the lengths of CA-repeats and to process it into a form suitable for further analysis. The command to read allele-size family data is:

read( alsize file_name(s)... locus locus_name(s)... options(s)... );

where file_name(s)... are the names of the files containing data, and locus_name(s)... are the names of the loci to be loaded from them (it is suggested that the real names of marker loci are used rather than some local convention).

gas incorporates two methods for relating measured lengths to the alleles, called fixed' and adaptive' binning. With fixed binning the user enters the actual known sizes of all of the alleles when the locus is specified and the new data is categorized according to this.

If the allele sizes are unknown then gas uses an adaptive binning strategy which partitions sizes according to their natural clustering. Some experimental datasets are not sufficiently clear for adaptive global\fc {We say binning is global' if a whole population is scored identically, and local' if family-based subsets of the population may be scored differently.} binning, and in these cases adaptive local binning\fc {If this is necessary, then the effect of random variations in conditions between runs can be minimized by running all of a family simultaneously on the same gel.} may be used - in which alleles are scored separately within each family (data produced in this latter fashion can be used for linkage studies but not association). The graph option may be used to examine the quality of your data - ideally one wishes to see narrow peaks separated by broad empty intervals.

After the data has been read, gas will check it for consistency - ie.~that the alleles are sensible, distinct and do not imply illegitimacy. Messages are displayed (and copied to the logfile if you have opened one) which indicate any suspicious data and the action gas has taken to deal with it. All the good data is added to the internal database and may be used in the analysis routines or saved to a file via the write command. To correct errors it is necessary to either modify the input file directly, or else to return to the original machine data and re-generate the input file from there after making corrections.

## Input Format

The CA-repeat genotype data should be placed in files having either the following five column format:

locus_name pedigree subject allele1 allele2

or an equivalent four column format:

locus_name pedigree.subject allele1 allele2

where locus_name is the locus tested. The pedigree and subject names may be either separated by spaces or a decimal point. allele1 and allele2 are the measured lengths of the alleles belonging to the subject at that particular named locus. If data is missing or uncertain, then one or both of the allele entries should be replaced by \unknown. Data for several loci can appear within the same file.

#### * Example *

The file als.gas reads in pedigree, affection-status and some marker locus data from the g-format file als.ped and combines this with allele size data for locus ca1demo' held in the files als1.siz and als2.siz. The data is locally binned and written out to the file alsped.new in l-format.

In addition to the locus specification, there are several optional parameters which alter the criteria by which allele size data is read and binned. Parameters applicable to both global and local binning are:

parametereffect
graphdisplay input data in graphical format
minsizealleles shorter than this are rejected, default 10
maxsizealleles longer than this are rejected, default 1000
overwriteover-write existing data
showspreadgive statistical information on input data
psgraphicsbarcharts drawn with graph option

if overwrite is not included, then any existing data is treated as protected and a warning is given if an attempt to over-write it is made.

## Graphical Barchart

The graph option will generate barcharts showing the distribution of allele lengths in the input data. By default the barcharts are written in block' format to the output file, however if a psfile has been set, then a graphical barchart is drawn and subsidiary barcharts are produced showing the characteristics of the alleles in each of the individual input files.

It is highly recommended that the graph option is used with each set of incoming data to determine the appropriate method of binning - distinct clusters are suitable for global binning (either fixed or adaptive), whereas a continuous spread can only be scored using the local adaptive algorithm. The format of the command is

read( alsize file_names locus locus_names graph n );

where n is the number of lines used to represent one unit of allele length in the block format' text output. If n is not entered, then a default value of 4 is used. Note that the graph parameter over-rides all other options, and no binning is performed.

#### * Example *

The file bar.gas contains specifications for 2 loci and reads pedigree data from bin.ped and size data from bin1.siz and bin2.siz. Barcharts are created for loci alpha and beta, and written in block format to the file albar.out and postscript format to albar.ps

Printing the results (preferably using a small fixed-width font for the outfile) is useful for examining the distribution across the full range of alleles and hence choosing the optimal algorithm.

## Fixed Global Binning

Fixed global binning is automatically used if the lengths of the alleles were specified when the named locus was described (using set locus. This is done (see Chapter~4) using either the size and range qualifiers or the minsize and maxsize qualifiers. The size ranges declared during the set locus stage may be superceded using: sizerange, which supercedes the value of range for all alleles.

Adaptive binning is used if the lengths of the alleles were not specified when the locus was described. If global binning is chosen, the data is pre-scanned before binning subjects to determine optimal bins, however this may be over-ridden with the control parameters:

typeparameterdescription
optional diffsizetwo alleles differing by more than this are different, default 1.2
orderfirstalleles are labelled in the order they are read
samesizetwo alleles differing by less than this are the same, default 0.95\cr

The adaptive binning method uses the values samesize and diffsize to determine which alleles are the same and which are different, according to the criteria:

conditionaction taken by gas
dl < samesize} alleles are assumed to be the same
samesize <= dl <= diffsize ambiguous, give a warning message
diffsize < dl alleles are assumed to be different

where dl is the difference in length between two alleles.

To alter the criteria for local binning, include one or both of the parameters

samesize = a diffsize = b

inside the brackets, where a and b are the desired new values. For instance, the command

  read( alsize  gel1.siz gel2.siz  locus d1s20 d1s22
samesize=0.8  diffsize=1.9  minsize=115\cr
maxsize=143  orderfirst );


will reject any length shorter than 115 or longer than 143, two alleles less than 0.8 apart will be set as identical, and more than 1.9 will be assumed different, and alleles will be labelled in the order in which they are first encountered when reading the data for each family.

### Ambiguous Alleles

Often it will be impossible for gas to decide exactly how to score every single allele. Whenever such ambiguous alleles are found the user is shown the nearest bin(s), together with the distance of the allele from these bin(s), and then presented with a choice of options. These are:
• Put the ambiguous allele into a nearby bin, which is then stretched to accommodate the allele. Enter the index number of the bin.
• Make a new bin centred near the allele. Enter m'.
• Score the allele as unknown. It is marked as x in the pedigree and takes no part in subsequent analysis. Enter x'
• Show the list of bins currently defined. Enter s'.

### Bin Display

If the s' option is selected in response to gas finding an ambiguous allele, the list of bins currently defined is displayed in the format:

 Bin Alleles Diffsize Limits Size of Bin Range of Alleles NB Na Dmin - Dmax Smin - Smax Rmin - Rmax

These figures have the following interpretation:

NBthe index of the bin
Nathe number of alleles currently in the bin
Dmin , Dmax lengths outside this range are put in a different bin
Smin , Smax lengths inside this range are put in bin NB
Rmin , Rmax this is the range of alleles currently in bin NB

An arrow <- is displayed beside the bin(s) closest to the ambiguous allele, and the user is presented with the list of options again.

In local binning each family is scored separately. This is the most robust method for poor quality data, but (since allele 1' may correspond to different physical lengths in different families) it cannot be used for association tests. Local binning is the default for the adaptive method.

With global binning, the same bins are used for every family in the dataset. The names and sizes of the bins may be saved (this is necessary if additional people are to be added at a later date) using write( data ... );. To use global binning, include the parameter global in the read command:

The first stage in adaptive global binning is to pre-scan the data to locate clusters of alleles. These clusters are marked as bins until it is impossible to find a new one which would have at least 2+N/100 datapoints in it (where N is the total number of datapoints being read). This cutoff point can be altered using the minprebin parameter. If the final results are unsatisfactory (and you've experimented with a few values of samesize and diffsize) then you could try manually entering the allele sizes at the earlier set locus... stage, using the output from the graph command as a guide. If this doesn't work then your only recourse is to switch over to local binning.

### Xambiguous

Using the xambiguous parameter will cause all ambiguous alleles to be marked as unknown x. However this should be used sparingly as it could wipe out a lot of recoverable data.

#### * Example *

The file bin.gas contains specifications for 4 loci and reads pedigree data from bin.ped and size data from bin1.siz and bin2.siz. Locus alpha is binned using the global adaptive algorithm. Locus beta' is binned globally using fixed bins (note the specification of the allele lengths in the gasfile) and locus gamma' is binned locally. A series of warnings will be produced during the binning process - enter c' the press the key each time that gas asks if you wish to "Continue or Quit?". The messages are copied into the logfile bin.log and the good data is written to binped.new for inspection. The new bins created for locus alpha' are written to the file binloc.new, and may be used for fixed-binning if subsequent data is added.\fc {Note that in this case it may be necessary to add further bins manually if the new dataset contains genuine alleles not present in the initial one. Alternatively both the initial and new size data could be re-loaded simultaneously.}

## Consistency Checking

After the allele-size data has been read, gas will check it for consistency - ie. that the alleles are sensible, distinct and do not imply illegitimacy. Messages are displayed (and copied to the logfile if you have opened one) which indicate any suspicious data and the action gas has taken to deal with it. All the good data is added to the internal database and may be used in the analysis routines or sent to a file via the write command. The following checks are performed:
1. A warning is given if the alsize input file contains any subject not listed in the original pedigree data file(s).
2. A warning is given if the allele lengths are outside of normal limits (the default valid range is 20 to 1000bp).
3. After labelling the alleles a test is performed to check that each child has a genotype consistent with that of their parents and any full siblings. If an inconsistency is found, the locus data for the whole family containing it is labelled as being unknown *.
4. A warning is issued if the new data will over-write previously loaded data.
5. With fixed global binning, a warning is signalled if any allele length lies outside the valid known ranges for the alleles. In this case the allele is marked as unknown.

Note from (3) that if even a single bad inheritance is detected within a pedigree then ALL of the members of that pedigree are marked as unknown x' at that locus - the data for the other members will not be accepted until the user has resolved the bad inheritance manually. To correct such errors it is necessary to either modify the input file directly, or else to return to the original machine data and re-generate the input file from there after making corrections.

# Postscript Graphics

Some of the routines in gas are able to display their results graphically. This is done by creating a Postscript file which can be viewed\fc {For instance using ghostview' previewer, which is available freely via FTP.} or printed\fc {Using a Postscript compatible laserprinter.} after the gas run has completed. Routines which provide graphical output are marked with the \psgraphics{} symbol.

## Output File

The graphical option is selected by specifying a file using set psfile, thus

set psfile = filename.ext;

would direct the graphical output into the file filename.ext. By default, postscript files produced by gas will be given the extension ps' so that

set psfile = project;

will produce a file called project.ps

## Format Control

The pscontrol command can be used to alter the style of the graphical output, for instance to change the size of text or the number of graphs per page. If required this should be done after the psfile is selected, and before the program; command, thus:

   set psfile = filename;\cr
.
pscontrol( options );
.
program;


The available options are:

parameterdescription
clippingmethod of line clipping
displayfileshow the name of the postscript file
displaygasshow version of gas program used
displaypageshow page number
displaytimeshow time of creating file
fontnameset name of text font
fontscaleset scaling factor for text font
groupingwhether to group related graphs
layoutoriginset x,y origin of graphs
layoutsizeset size of page for graphs
layoutxdataset range of x-axis
layoutxpageset x-location of graphs
layoutydataset range of y-axis
layoutypageset y-location of graphs
showpointsplot symbols on line graphs
smoothfactorinterpolation points for smoothing
smoothingtype of curve smoothing to use
symbolpointstype of symbols to use

## General Parameters

### Clipping

When drawing curves it is possible that the line will go outside of the current graph, and the value of clipping determines what to do if this happens.

valueresult
clipping=0 ignore places where line is off graph
0 < clipping < 1 colour axes grey where line is off graph
clipping=1 draw line even if it escapes' graph

The default value is 0.5 which makes the axes medium grey if the line goes off the graph (the higher the value of clipping, the darker the out-of-range sections will be shaded).

### Grouping

To reduce paper usage gas will group some types of related graphs on the same page. This feature may be turned off using the parameter

grouping = n

in which case all the graphs will be expanded to full size and drawn on separate pages.

### Showpoints and Symbolpoints

Curves in gas are drawn from a set of points at which the precise function values are known, and the showpoints parameter will cause gas to mark these points with a symbol. The symbolpoints parameter allows the user to select the symbol to use (a value of n gives an n-sided polygon, and -n gives an n-sided star). Hence

showpoints=y symbolpoints=5

will draw pentagrams at the points gas is using to plot curves.

### Smoothing and Smoothfactor

All graphics routines draw curves by joining up straight line segments. The default mode in gas (which is recommended as being the most honest') is to draw single straight lines between points at which function values are known (smoothing=0). However cubic spline interpolation can be used to achieve a more rounded' effect by interpolating extra points between the known values. To use this select

smoothing = 1

The value of smoothfactor determines the number of intermediate points that the smoothing algorithm will create (default is 5), and higher values give smoother curves, though the size of the postscript file will be increased.

Beware: it is possible that the interpolation process may produce curves which overshoot'. If you are going to use this option, then you must check (using showpoints) that this does not happen with your dataset.

## Displaying Page Information

The display option determine how much indentification information is printed on each page. They take the values y and n, with the default being to show all the information.

parameterdescription
displayfile Controls whether the name of the psfile is shown.
displaygas Controls whether the gas identification logo is shown.
displaypage Controls whether the page number is shown.
displaytime Controls whether the time that the file was opened is printed on each page.

Thus to prevent the file name being displayed, include the parameter

displayfile = n

## Font Control

The font* variables allow the font used for text to be modified.

### Font Name

The fontname control which Postscript font is used to label graphs. The available (will work on most Postscript printers) fonts are:
• tr = Times Roman
Hence to select Times Roman (which is the default) enter

fontscale = tr

### Font Scaling

The fontscale parameter allows the size of the text to be magnified by a factor between 0.01 and 100. Thus

fontscale = 0.2

will shrink all text by a factor of 5.

## Layout Control

The layout variables control the position and spacing of graphs on the page. Using these parameters gives fairly low level control of the internal gas plotting routines, and it's thus possible to produce some horribly messy diagrams by setting inappropriate values.

### Layout Size

The layoutsize variable controls the fraction of the page that will be drawn on. So that

layoutsize = 0.5

will use half of the page (centred on the available area).

### Layout of Page

The layout*page variables control the fraction of the draw-able page that is used for the graphs (all parameters must thus be between 0 and 1).

layoutxpage = 0.1 0.4

layoutypage = 0.0 0.8

### Layout of Data

The layout*data variables determine the range of the axes used for plotting.

layoutxdata = 3.5 1e+3

layoutydata = -1 4.0

### Layout of Origin

The layoutorigin variable controls where the origin of the axes is to be located.

layoutorigin = -2.0 5.3

# Creating Subsets

There is often a need to sub-divide the members within a pedigree according to phenotypic or relational criteria. The select and delete functions provides several options for automatically categorizing subjects and creating new pedigrees from subsets of the original.

The syntax and options for select and delete are identical, and discussed after the individual routine specifications.

Apology: the select and delete functions are very powerful, and as a result the format required to control them may at first sight seem confusing. However it is strictly logical, (there are no special' cases to learn) and a large number of examples are provided in the text to enable users to modify them as required.

## Select

New datasets may be constructed from old ones by creating subsets using the select routine. This examines all the individuals in the pedigree according to user-specified criteria and removes those subjects who do not satisfy them. It has the syntax:

select( criteria... then selectee(s)... optional... );

The criteria is composed of instructions asking gas to count some values relating to a subject and to perform numeric tests on them. If the overall outcome of the tests is true then the selectee(s)... listed after then are marked for selection. After the whole pedigree has been tested, all of the subjects not marked for selection are removed.

If then... is not included, the routine assumes that just the subject being tested is to be affected by the result (ie. it is equivalent to entering then subject). The optional parameters makegood will cause select to also remove families which no longer consist of a single group.

## Delete

New datasets may be constructed from old ones by creating subsets using the delete routine. This examines all the individuals in the pedigree according to user-specified criteria and removes those subjects which satisfy them. It has the syntax:

delete( criteria... then deletee(s)... optional... );

The criteria... is composed of instructions asking gas to count some values relating to a subject and to perform numeric tests on them. If the overall outcome of the tests is true then the deletee(s)... listed after then are marked for deletion. After the whole pedigree has been tested, all of the subjects not marked for deletion are retained.

If then... is not included, the routine assumes that just the subject being tested is to be affected by the result (ie. it is equivalent to entering then subject). The optional parameters makegood will cause delete to also remove families which no longer consist of a single group.

## Criteria

The first part of both select and delete is the criteria which an expression who's True/False value is to be used as a basis for choosing subjects. Criteria may be simple (composed only of one condition) or compound (made up of several simple criteria tied together using and' and or').

### Simple Criteria

There are two main choices for simple criteria:
• locus locus_name
• relatives
The test part of each criteria is described using one of the operators >, =, <, <=, >= or !=, followed by a number. For instance, to select all individuals who have qualitative locus atopy greater than 3.5 enter

select ( locus atopy subject a > 3.5 );

## Loci

Selection can be made according to the locus values in categories including brother, child, father, mother, sibling, sister, and subject. A full list of the countable relatives can be obtained using the programs gas help feature.

### Affection Loci

With affection loci the value is either y, n, x, or liab class_name according to whether we wish to select individuals who are affected, unaffected, unknown-status, or belong to a particular liability class.

To select all individuals who have at least one affected child

select ( locus locus_name child y >= 1 );

To discard all individuals who have unknown status:

delete ( locus locus_name subject x > 0 );

To select all individuals who belong to liability class 2

select ( locus locus_name subject liab 2 > 0 );

To select all individuals who have non-affected mothers

select ( locus locus_name mother n > 0 );

To delete all individuals who do not have a father in liability class bald

delete ( locus locus_name father liab bald < 1 );

#### * Example *

The gasfile sela.gas reads locus data from sel.loc and pedigree data from sel.ped, then selects all the subjects who have at least one non-affected child, and writes the new pedigree in g-format to the file selaped.new.

#### * Example *

The gasfile sell.gas reads locus data from sel.loc and pedigree data from sel.ped, then selects all the subjects who have a father in liability class blue (these fathers also being selected).

### Binary loci

There are at present no select options available for binary loci.

### Named loci

With named loci the value is either heterozygous, homozygous or allele names (including the wildcard' values matcha and matchx). The matcha parameter will match any allele which is not x. The matchx parameter will match any allele including x.

To select all individuals who have genotype 1 3'

select ( locus locus_name subject 1 3 > 0 );

To select all individuals who have a heterozygous father

select ( locus locus_name subject heterozygous > 0 );

To delete all individuals who have more than two homozygous children

delete ( locus locus_name child homozygous > 2 );

To select all individuals who have genotype 1 x'

select ( locus locus_name subject 1 x > 0 );

To delete all individuals who have at least one copy of allele b'

delete ( locus locus_name subject b matchx > 0 );

To select all individuals who have two known alleles

select ( locus locus_name subject matcha matcha $>$ 0 );

### Quantitative loci

Subjects may be selected or discarded according to the values of quantitative loci. Three values s, a and g may be used for comparison - indicating the smallest, average and greatest values of a set of individuals (for categories with only one individual, such as mother, these have identical affect). To delete all individuals who have values greater than or equal to 6:

select ( locus locus_name subject a < 6 );

To select all individuals who have fathers with positive values

select ( locus locus_namefather a > 0 );

To delete all individuals who have at least one child with a value less than 20

select ( locus locus_name child s < 20 );

To select all individuals who have no children with a value greater than 20:

select ( locus locus_name select g < 20 );

To select all individuals who have children with an average value less than 20:

select ( locus locus_namechild a < 20 );

#### * Example *

The gasfile selq.gas reads pedigree data from the file sel.ped, selects all the subjects who have a value less than 2.5 at locus response', and writes the new pedigree in g-format to the file selqped.new.

## Relatives

Individuals may be selected according to the number of relatives they have in their pedigree (see gas help for a full list of the categories of countable relatives).

To select individuals who have a father in the pedigree:

select ( relative father > 0 );

To delete individuals who do not have a mother in the pedigree

delete ( relative mother < 1 );

To select individuals who have at least 3 children

select ( relative child > 2 );

#### * Example *

The gasfile selr.gas reads pedigree data from the file sel.ped, selects all the subjects with at least two daughters, and writes the new pedigree to the file selrped.new.

## Compound Criteria

The basic commands may be combined to produce more complex effects using the and' && and or' || operators together with appropriate bracketting. Also the expressions used for simple criteria may be compared against each other.

To select families having discordant sib-pairs (ie.~some children affected and some normal) use the following:

select ( locus locus_name child y > 0 && locus locus_name child n > 0);

To delete individuals who have more affected sons than daughters

select ( locus locus_name son y > locus_name daughter y );

To select individuals who have a normal father and affected mother, or vice versa:

   select
( ( locus locus_name father n > 0 && locus locus_name mother y > 0 )
||
( locus locus_name father y > 0 && locus locus_name mother n > 0 ) );


To select individuals for which the sum of two quantitative loci is less than 7.5

select ( locus locus_name_1 subject a + locus locus_name_2 subject a > 7.5 );

To select individuals who are affected at a particular locus and whose children all have a quantitative value greater than 2.5 at a second locus:

select ( locus locus_name_1 subject y > 0 && locus locus_name_2 child s > 2.5 );

To select individuals who are affected at exactly two out of three loci:

select( locus locus_name_1 subject y + locus locus_name_2 subject y + locus locus_name_3 subject y = 2 );

#### * Example *

The gasfile selc1.gas reads pedigree data from the file sel.ped, selects all the subjects affected at locus disease' who have at least one son. The new pedigree is written in g-format to the file selc1ped.new.

#### * Example *

The gasfile selc2.gas reads pedigree data from the file sel.ped, selects all the subjects affected at locus disease who have a value either less than 1.5 or greater than 7.0 at locus response'. The new pedigree is written in g-format to the file selc2ped.new.

## Use of then'

The then statement allows relatives of the person being tested to be selected or deleted. The list of relatives is the same as that for select

 To select/delete only the subject (this is also done if you omit the then part entirely) ... then subject To select/delete both the subject's parents (note that in this case the subject itself will not be acted upon) ... then father mother To select/delete the subject and it's children ... then subject child To select/delete the whole family of the subject ... then family

# Editing Data Descriptions

gas provides facilities for altering the structure and appearance of data. These are grouped under the edit command, which has the format

edit( parameters... );

The main parameters used with the edit command are

Lformat The lformat option changes all alphabetic names in the loci and pedigree specification to unique numbers, and writes a list of translations to the logfile. This allows data to be stored in full g-format (ie. using real names for families, diseases and loci) and quickly converted into pure numeric format for running programs such as FASTLINK. The syntax is

edit( lformat );

## Twogen

The twogen option re-structures a pedigree into a series of two-generation families. All extended families are split up into a number of nuclear families and given new names. The syntax is

edit( twogen );

## Locus

The locus parameter, followed by the name of a locus, has several options:

### Nonparam

The command

edit( locus qv nonparam );

will replace the values of quantitative locus qv in each subject by their non-parametric rank in the current pedigree.

### Delete

The command

edit( locus lc delete );

will delete locus lc from the current dataset. This will reduce the RAM required by gas and may improve performance on very small machines (it also provides an alternative to using the locus parameter with write( pedigree... ); when you only want to write out a subset of the full data).

### Makelocal

The command

edit( locus lc makelocal );

reduces the number of alleles (of the named locus lc) in each pedigree as far as possible, while maintaining uniqueness. This is equivalent to scoring each family locally (see read(alsize);).

### Rename

The command

edit( locus la rename lb );

will rename locus la to lb.

### Updatefreq

The command

edit( locus la updatefreq option );

alters the input allele frequencies for the locus la so that they are equal to those found in the dataset. The optional parameters parent and child set the frequencies to those found in the parental and child populations respectively, (See dissect(locus); for a definition of parents and children in this context) otherwise the frequencies found in the whole input dataset are used.

### Lumpalleles

The command

edit( locus la lumpalleles old_alleles... into new_allele );

Takes the set of alleles listed as old_alleles... and renames them all to the single name new_allele.

# Making New Loci

gas is able to create new trait loci according to the values of existing loci defined at input. For instance, subjects could be defined as being affected by the new trait atopy if they were affected at 2 or more of the affection loci asthma, eczema and rhinitis.

This is done using the set newlocus command, which has the same format as set locus except for the addition of one or more bracketted expressions which describe how the value at the locus is to be calculated.

   set newlocus locus_name  type_of_locus
number_of_alleles gene_frequencies...
specifications...;


The bracketted expressions have the format

( item expression_to_calculate );

where the item is the part of the new locus that is to be set, and expression_to_calculate is in the same format as that used by select and delete.

Note that, when calculating a new value, it is important to test that the components are known values, otherwise the results are likely to be incorrect.

## Affection Loci

For affection loci there are 3 choices of item. These are:

parameterdescription
xset status as unknown if true
yset status as affected if true
nset status as not-affected if true

If the ( x ... ) bracket is present and true for a particular subject then the ( y ... ) and ( n ... ) terms are ignored. The ( x ... ) term cannot be used alone.

If both the ( y ... ) and ( n ... ) present, but neither is true, then the value is taken to be unknown. The ( y ... ) term is always evaluated before the ( n ... ) term, and if the former is true, then the latter is ignored.

## Quantitative Loci

For quantitative loci there are 2 choices of item. These are:

parameterdescription
xset status as unknown if true
qset value to numeric expression

If the ( x ... ) bracket is present and true for a particular subject then the ( q ... ) term is ignored. The ( x ... ) term cannot be used alone.

The ( q ... ) term may use any of the mathematical functions built into gas - logarithms, exponentials, +, -, etc. (Use gas help for a list of the mathematical functions to combine and transform the values of loci and relative counts).

#### * Example *

The gasfile makenew.gas reads pedigree data from the file skin.ped, and creates four new loci. Subjects are assumed to be affected at locus spotty if they are affected at either of loci acne or pimples. Subjects are assumed to be affected at locus very_spotty if they have a value greater than 5 at quantitative locus spots_per_cm2, or if they are affected at locus boils. Locus logspotty is set to be the logarithm of the value at locus spots_per_cm2 (note the test to avoid zero arguments). The new locus specifications are written to the file newskin.loc and the modified pedigree is saved in the file newskin.ped.

# Appendix C - Control Variables

Below is a summary of the gas control variables listed elsewhere in this manual. The `*' denotes that there is no default value for a particular item, and that it will not be used unless explicitly set by the user.

variablefunctionvaluesdefault
autoloop automatically break loopsy/ny
checkloop check for loopsy/ny
checkrelated check all of pedigree is relatedy/ny
logfile name of logfile**
loopbreak break loop at locationpedigree subject*
lqunknown quantitative unknown-code for l-formatany whole number0
maxerrors maximum number of errors>= 15
maxwarnings maximum number of warnings>= 15
outfile name of output file*gas.out
proband designate proband for a familypedigree subject*