Data for the yeast genome (S. cerevisiae)
This is the data used in experiments to predict the functional class of yeast ORFs. These experiments are reported in
- the PhD thesis "Machine learning and data mining for yeast functional genomics", Amanda Clare, UWA, February 2003, pdf
- Clare, A. and King R.D. (2003) Predicting gene function in Saccharomyces cerevisiae. 2nd European Conference on Computational Biology (ECCB '03). (published as a journal supplement in Bioinformatics 19: ii42-ii49).
The predictions that were made from this data can be found Yeast Preds.
Classes
The classes were taken from the MIPS functional catalog on 24/4/02, and are listed in the file classes.txt. The actual functional assignments we used are in the file yeast_list_full.24.4.02.pl.
Sequence
This data was collected from a variety of sources, including ProtParam and MIPS. The attributes are as follows:
Attribute | Type | Description |
aa_rat_X | real | Percentage of amino acid X in the protein |
seq_len | integer | Length of the protein sequence |
aa_rat_pair_X_Y | real | Percentage of the pair of amino acids X and Y consecutively in the protein |
mol_wt | integer | Molecular weight of the protein |
theo_pI | real | Theoretical pI (ioselectric point) |
atomic_comp_X | real | Atomic composition of X where X is c (carbon), o (oxygen), n (nitrogen), s (sulphur) or h (hydrogen) |
aliphatic_index | real | The aliphatic index |
hydro | real | Grand average of hydropathicity |
strand | 'w' or 'c' | The DNA strand on which the ORF lies |
position | integer | Number of exons (how many start positions are there in its coordinates list). |
cai | real | Codon adaption index: calculated according to Sharp and Li \shortcite{Sharp1987} |
motifs | integer | Number of motifs: according to PROSITE dictionary release 13 of Nov. 1995 (Bairoch1996) |
transmembraneSpans | integer | Number of transmembrane spans: calculation follows Klein et al. (Klein1985) using the ALOM program. P:I threshold value of 0.1 is used for ORF products which have at least only one transmembrane span. P:I threshold value of 0.15 is used for all TM-calculated proteins. (Goffeau1993) |
chromosome | 1..16, mit | Chromosome number for this ORF |
The data is available here in C4.5 format. .names files give attribute names. .train, .valid and .propertest are training, validation and propertest data respectively. .unknown is data about ORFs whose functions were listed in categories 99,0,0,0 (unclassified proteins) or 98,0,0,0 (classification not yet clear cut), or those that had no function listed at this level in the hierarchy. The numbers 1-4 correspond to levels in the MIPS functional classification (1 is the most general, 4 is most specific), and 0 represents function classification at all appropriate levels.
The files can be up to 1.5M each in size.
Phenotype
See also this pageabout learning with phenotype data.
Original sources of data:
- TRIPLES raw data and processed data
- EUROFAN
- MIPS
Reformatted for C4.5 (large files are bzip2-ed):
- level 1: known - unknown - names
- level 2: known - unknown - names
- level 3: known - unknown - names
- level 4: known - unknown - names
Description of attributes(growth media).
Expression
Download data as a single gzipped tar file. See also this pageabout results from clustering expression data.
Original data sources:
Homology
The patterns discovered by PolyFARM and used as boolean attributes hompatterns.gz (378K)
The list of dbref terms used dbrefnames
The species hierarchy/taxonomy specieshierarchy
The homology data in a relational format (broken into pieces for easier download/maintenance) - each piece is about 7-10M in size as a bzipped file.
Associations are constructed from the following set of predicates:
Fact | Description |
eval(Orf, SPId, EVal) | The e-value of the similarity between the ORF and the SWISSPROT protein |
yeast_to_yeast(Orf, Orf, EVal) | The e-value between this ORF and another ORF in the yeast genome. |
sq_len(SPId, Len) | The sequence length of the SWISSPROT protein |
mol_wt(SPId, MWt) | The molecular weight of the SWISSPROT protein |
classification(SPId, Classfn) | The classification of the organism the SWISSPROT protein belonged to. This is part of a hierarchical species taxonomy. The top level of the hierarchy contains classes such as "bacteria" and "viruses" and the lower levels contain specific organism such as "escherichia" and "saccharomyces". |
keyword(SPId, KWord) | Any keywords listed for the SWISSPROT protein. Only keywords which could be directly ascertained from sequence were used. These were the following: transmembrane, inner_membrane, plasmid, repeat, outer_membrane, membrane. |
db_ref(SPId, DBName) | The names of any databases that the SWISSPROT protein had references to. For example: PROSITE, EMBL, FlyBase, PDB. |
Each association is converted into a boolean attribute representing the answer to: does this ORF have this association or not?
Boolean (c4.5) data:
The data is available here in C4.5 format. .names files give attribute names. .train, .valid and .propertest are training, validation and propertest data respectively. .unknown is data about ORFs whose functions were listed in categories 99,0,0,0 (unclassified proteins) or 98,0,0,0 (classification not yet clear cut), or those that had no function listed at this level in the hierarchy. The numbers 1-4 correspond to levels in the MIPS functional classification (1 is the most general, 4 is most specific), and 0 represents function classification at all appropriate levels.
Predicted Secondary Structure
The structure data in a relational format struct.discretised.kb.gz (2.6M)
The patterns discovered by PolyFARM and used as boolean attributes structpatterns.gz(133K)
Secondary structure is predicted by Prof. Associations are constructed from the following set of predicates:
Predicate | Description |
ss(Orf, Num, Type) | This Orf has a secondary structure prediction of type Type (alpha, beta or coil) at relative position Num. For example, ss(yal001c,3,alpha) would mean that the third prediction made for yal001c was alpha. |
alpha_len(Num, AlphaLen) | The alpha prediction at position number Num was of length AlphaLen |
beta_len(Num, BetaLen) | The beta prediction at position number Num was of length BetaLen |
coil_len(Num, CoilLen) | The coil prediction at position number Num was of length CoilLen |
alpha_dist(Orf, Percent) | The percentage of alphas for this ORF is Percent |
beta_dist(Orf, Percent) | The percentage of betas for this ORF is Percent |
coil_dist(Orf, Percent) | The percentage of coils for this ORF is Percent |
nss(Num1, Num2, Type) | The prediction at position Num2 is of type Type (we used Num2 = Num1+1 ie Num1 and Num2 are neighbouring positions) |
Each association is converted into a boolean attribute representing the answer to: does this ORF have this association or not?
Boolean (c4.5) data:
The data is available here in C4.5 format. .names files give attribute names. .train, .valid and .propertest are training, validation and propertest data respectively. .unknown is data about ORFs whose functions were listed in categories 99,0,0,0 (unclassified proteins) or 98,0,0,0 (classification not yet clear cut), or those that had no function listed at this level in the hierarchy. The numbers 1-4 correspond to levels in the MIPS functional classification (1 is the most general, 4 is most specific), and 0 represents function classification at all appropriate levels.
Predictions
The predictions that were made from this data can be found here