LINKAGE file formats



In developing this package of genetic analysis programs I decided not to invent a new set of file formats for input and output, but to use an existing standard format. Although LINKAGE format is somewhat cumbersome and has some redundancy it is nonetheless quite a rich format that is established and well documented. See the LINKAGE homepage at http://linkage.rockefeller.edu for a complete specification of the formats.

Moreover, many groups that are analysing genetic data will have programs to generate LINKAGE format files automatically, and the Mega2 genetic data translation program has LINKAGE as one of its standard formats. See the Mega2 homepage as http://watson.hgen.pitt.edu/docs/mega2_html/mega2.html for comlete information.

There is a good deal of information in the input files that my programs will ignore, and there are several features that have not yet been implemented. The input routines for the complete package have now been rewritten to check the data more thoroughly and to insert sensible default values for for missing or misspecified inputs where appropriate. CheckFormat is a stand alone program that simply reads in, checks and outputs LINKAGE data files that can be used to debug the input.

The following notes will give and indication of which features of the LINKAGE format are currently used and implemented, and which ignored. I will assume that the reader is reasonably familiar with the general format.

  • The format that is used is old fashioned or "post-makeped" format not the more concise "pre-makeped" format. You can translate from pre to post makeped using CheckFormat as follows
    • % java CheckFormat pre.par pre.ped post.par post.ped -pre

  • Pre-makeped format differs from post-makeped format only in the pedigree input file. For each line of input the pedigree data is specified only as an offspring, father, mother triplet with no fields for the other three relationship pointers. Also there is no proband status indicator. Other than that there is no difference.

    There is a good case for making the slightly simpler pre-makeped format the default since the ommitted relationship pointers are redundant. However, where I work we use post-makeped file more often so that's the default.


  • Only the affection status , quantitative variables and numbered allele locus formats are currently implemented. The binary factors locus format is not currenly implemented.

  • For quantitative variables a missing or unobserved value is coded as a 0.0 . This is not a particularly good convention as is does not allow a zero observation to be specified. More explicitly the program considers an observation with absolute value less than 0.0000001 to indicate a missing value. So if there is a true observed value smaller than this it has to be increased or decreased slightly. This should not affect the computations too drastically, unless of course the trait has mean zero and very small variance in which case some shifting or rescaling is necessary.

  • If a multivariate quantitative trait has any element unobserved for a particular individual, then the whole observation for that locus for that individual is ignored.

  • The only information actually used from the parameter input file are the number of loci, the locus by locus parameter information and the distances between the loci.

  • The order of the loci in the data is assumed to the order in which the appear in the file. Thus any information on the third line of the input file is ignored. The other information is read and copied but not used.

  • The distances between loci need to be recombination fractions. If a distance of greater than 0.5 is read, then the program assumes that the distances are specified as centi Morgans and these are converted to recombination fractions using the Kosambi transformation. A warning that this is being done is printed. Note that this is not a good way to code your data and you should use recombination fractions rather than rely on an unreliable value based inference.

  • All programs in this package allow "half observed" genotypes when using numbered allele loci. For example 0 3 would mean that one allele is a 3 but the other is unobserved. This is mostly used to fix up detected genotype errors.

  • Some programs in the package will allow specification of only 1 parent for and individual while others require that there are either 0 or 2 specified. A warning is printed if this occurs.

  • The first child, father's next child and mother's next child pointers in the pedigree file are ignored and can safely be replaced by zeros, or any other string not including white space.

  • If an affection status locus has only 1 liability class, the phenotype for an individual is specified by a single digit representing the affection status only.

Here are example paramter and pedigree files.