This is the data used in experiments to predict the functional class of yeast ORFs. These experiments are reported in
The predictions that were made from this data can be found here.
The classes were taken from the MIPS functional catalog on 3/3/04 and GeneOntology on 2/3/04 and are listed as follows:
This data was collected from a variety of sources, including ProtParam and MIPS. The attributes are as follows:
Attribute | Type | Description |
strand | 'w' or 'c' | The DNA strand on which the gene lies |
chromo | 1,2,3,4,5 | The chromosome on which the gene lies |
startpos | integer | Start position of seq |
endpos | integer | End position of seq |
numpos | integer | Number of exons |
numaa | integer | Number of amino acids |
mol_wt | integer | Molecular weight of the protein |
theo_pI | float | Theoretical pI (ioselectric point) |
percentneg | float | Percent of negatively charged residues |
percentneg | float | Percent of positively charged residues |
carbon | float | Atomic composition of carbon |
hydrogen | float | Atomic composition of hydrogen |
nitrogen | float | Atomic composition of nitrogen |
oxygen | float | Atomic composition of oxygen |
sulphur | float | Atomic composition of sulphur |
aliphatic | float | The aliphatic index |
instability | float | The instability index |
gravy | float | Grand average of hydropathicity |
X_ratio | float | Percentage of amino acid X in the protein |
seq_len | integer | Length of the protein sequence |
XYN_ratio | float | Percentage of the pair of amino acids X and Y separated by N-1 amino acids in the protein. That is, XY1 is X and Y adjacent. XY2 is X and Y separated by 1 other amino acid. |
The data is available here in C4.5 format. .names files give attribute names. .train, .valid and .propertest are training, validation and propertest data respectively. .unknown is data about ORFs whose functions were listed in categories 99,0,0,0 (unclassified proteins), 98,0,0,0 (classification not yet clear cut) or GO:0005554 (molecular_function unknown) or those that had no function listed at this level in the hierarchy. The numbers 1-4 correspond to levels in the MIPS/GO functional classifications (1 is the most general, 4 is most specific).
The files can be up to 59M each in size.
Secondary structure is predicted by Prof. Associations are constructed from the following set of predicates:
Predicate | Description |
ss(Orf, Num, Type, FollowingNum) | This Orf has a secondary structure prediction of type Type (alpha, beta or coil) at relative position Num (position FollowingNum is the next position after this). For example, ss(at1g01010,3,a,4) would mean that the third prediction made for at1g01010 was alpha. |
a_len(Num, AlphaLen) | The alpha prediction at position number Num was of length AlphaLen |
b_len(Num, BetaLen) | The beta prediction at position number Num was of length BetaLen |
c_len(Num, CoilLen) | The coil prediction at position number Num was of length CoilLen |
alpha_dist(Orf, Percent) | The percentage of alphas for this ORF is Percent |
beta_dist(Orf, Percent) | The percentage of betas for this ORF is Percent |
coil_dist(Orf, Percent) | The percentage of coils for this ORF is Percent |
notequal(X,Y) | Variables X and Y should not unify |
Each association is converted into a boolean attribute representing the answer to: does this ORF have this association or not? The list of associations and their corresponding attribute numbers is strucAllnondup.nums.
Boolean (c4.5) data:
The data is available here in C4.5 format. .names files give attribute names. .train, .valid and .propertest are training, validation and propertest data respectively. .unknown is data about ORFs whose functions were listed in categories 99,0,0,0 (unclassified proteins) or 98,0,0,0 (classification not yet clear cut), or those that had no function listed at this level in the hierarchy. The numbers 1-4 correspond to levels in the MIPS/GO functional classification (1 is the most general, 4 is most specific).
Sequence similarity (usually implying homology) is detected by a PSI-BLAST (blastpgp) search against NRDB. Associations are constructed from the following set of predicates:
Predicate | Description |
eval(Orf, SPID, EVal) | Orf is similar to SwissProt protein with accession SPID, with e-value Eval. |
desc(SPID,X) | SwissProt protein SPID had description word X |
db_ref(SPID,X) | SwissProt protein SPID had a database reference to the X database |
keyword(SPID,X) | SwissProt protein SPID had keyword X |
species(SPID,X) | SwissProt protein SPID belonged to species X |
species(SPID,X) | SwissProt protein SPID belonged to classification X in the species taxonomy |
sq_len(SPID,X) | SwissProt protein SPID had sequence length X |
mol_wt(SPID,X) | SwissProt protein SPID had molecular weight X |
Each association is converted into a boolean attribute representing the answer to: does this ORF have this association or not? The list of associations and their corresponding attribute numbers is homAllnondup.nums.
Boolean (c4.5) data:
The data is available here in C4.5 format. .names files give attribute names. .train, .valid and .propertest are training, validation and propertest data respectively. .unknown is data about ORFs whose functions were listed in categories 99,0,0,0 (unclassified proteins) or 98,0,0,0 (classification not yet clear cut), or those that had no function listed at this level in the hierarchy. The numbers 1-4 correspond to levels in the MIPS/GO functional classification (1 is the most general, 4 is most specific).
SCOP superfamily predictions, as made by the Superfamily server. Attributes are the classes in the SCOP hierarchy. Values are the e-values of a match to that family. Values of 10 are recorded where there is no match.
This data was derived using InterProScan.
The mapping from associations to the corresponding attribute numbers is in interprotrain.out.s100.d4.namesmap
Some of the microarray data from NASC. Results of 43 experiments from cds between Dec 2002 and Jan 2004 using signal, detection call and detection P-values.
To come.
Back to Computational Biology data sets and services or Amanda Clare's research