Arabdata : Department of Computer Science , Aberystwyth University

This is the data used in experiments to predict the functional class of yeast ORFs. These experiments are reported in

Clare, A., Karwath, A., Ougham, H. and King, R. D. (2006) Functional Bioinformatics for Arabidopsis thaliana. Bioinformatics 2006 22: 1130-1136

The predictions that were made from this data can be found here.

Classes

The classes were taken from the MIPS functional catalog on 3/3/04 and GeneOntology on 2/3/04 and are listed as follows:

Sequence

This data was collected from a variety of sources, including ProtParam and MIPS. The attributes are as follows:

Attribute	Type	Description
strand	'w' or 'c'	The DNA strand on which the gene lies
chromo	1,2,3,4,5	The chromosome on which the gene lies
startpos	integer	Start position of seq
endpos	integer	End position of seq
numpos	integer	Number of exons
numaa	integer	Number of amino acids
mol_wt	integer	Molecular weight of the protein
theo_pI	float	Theoretical pI (ioselectric point)
percentneg	float	Percent of negatively charged residues
percentneg	float	Percent of positively charged residues
carbon	float	Atomic composition of carbon
hydrogen	float	Atomic composition of hydrogen
nitrogen	float	Atomic composition of nitrogen
oxygen	float	Atomic composition of oxygen
sulphur	float	Atomic composition of sulphur
aliphatic	float	The aliphatic index
instability	float	The instability index
gravy	float	Grand average of hydropathicity
X_ratio	float	Percentage of amino acid X in the protein
seq_len	integer	Length of the protein sequence
XYN_ratio	float	Percentage of the pair of amino acids X and Y separated by N-1 amino acids in the protein. That is, XY1 is X and Y adjacent. XY2 is X and Y separated by 1 other amino acid.

The data is available here in C4.5 format. .names files give attribute names. .train, .valid and .propertest are training, validation and propertest data respectively. .unknown is data about ORFs whose functions were listed in categories 99,0,0,0 (unclassified proteins), 98,0,0,0 (classification not yet clear cut) or GO:0005554 (molecular_function unknown) or those that had no function listed at this level in the hierarchy. The numbers 1-4 correspond to levels in the MIPS/GO functional classifications (1 is the most general, 4 is most specific).

The files can be up to 59M each in size.

Predicted Secondary Structure

Secondary structure is predicted by Prof. Associations are constructed from the following set of predicates:

Predicate	Description
ss(Orf, Num, Type, FollowingNum)	This Orf has a secondary structure prediction of type Type (alpha, beta or coil) at relative position Num (position FollowingNum is the next position after this). For example, ss(at1g01010,3,a,4) would mean that the third prediction made for at1g01010 was alpha.
a_len(Num, AlphaLen)	The alpha prediction at position number Num was of length AlphaLen
b_len(Num, BetaLen)	The beta prediction at position number Num was of length BetaLen
c_len(Num, CoilLen)	The coil prediction at position number Num was of length CoilLen
alpha_dist(Orf, Percent)	The percentage of alphas for this ORF is Percent
beta_dist(Orf, Percent)	The percentage of betas for this ORF is Percent
coil_dist(Orf, Percent)	The percentage of coils for this ORF is Percent
notequal(X,Y)	Variables X and Y should not unify

Each association is converted into a boolean attribute representing the answer to: does this ORF have this association or not? The list of associations and their corresponding attribute numbers is strucAllnondup.nums.

Boolean (c4.5) data:
The data is available here in C4.5 format. .names files give attribute names. .train, .valid and .propertest are training, validation and propertest data respectively. .unknown is data about ORFs whose functions were listed in categories 99,0,0,0 (unclassified proteins) or 98,0,0,0 (classification not yet clear cut), or those that had no function listed at this level in the hierarchy. The numbers 1-4 correspond to levels in the MIPS/GO functional classification (1 is the most general, 4 is most specific).

GO including IEA annotations

struc1.names - struc1.train.gz - struc1.valid.gz - struc1.propertest.gz - struc1.unknown.gz

struc2.names - struc2.train.gz - struc2.valid.gz - struc2.propertest.gz - struc2.unknown.gz

struc3.names - struc3.train.gz - struc3.valid.gz - struc3.propertest.gz - struc3.unknown.gz

struc4.names - struc4.train.gz - struc4.valid.gz - struc4.propertest.gz - struc4.unknown.gz

GO excluding IEA annotations

struc1.names - struc1.train.gz - struc1.valid.gz - struc1.propertest.gz - struc1.unknown.gz

struc2.names - struc2.train.gz - struc2.valid.gz - struc2.propertest.gz - struc2.unknown.gz

struc3.names - struc3.train.gz - struc3.valid.gz - struc3.propertest.gz - struc3.unknown.gz

struc4.names - struc4.train.gz - struc4.valid.gz - struc4.propertest.gz - struc4.unknown.gz

MIPS automatic annotations

struc1.names - struc1.train.gz - struc1.valid.gz - struc1.propertest.gz - struc1.unknown.gz

struc2.names - struc2.train.gz - struc2.valid.gz - struc2.propertest.gz - struc2.unknown.gz

struc3.names - struc3.train.gz - struc3.valid.gz - struc3.propertest.gz - struc3.unknown.gz

struc4.names - struc4.train.gz - struc4.valid.gz - struc4.propertest.gz - struc4.unknown.gz

MIPS manual annotations

struc1.names - struc1.train.gz - struc1.valid.gz - struc1.propertest.gz - struc1.unknown.gz

struc2.names - struc2.train.gz - struc2.valid.gz - struc2.propertest.gz - struc2.unknown.gz

struc3.names - struc3.train.gz - struc3.valid.gz - struc3.propertest.gz - struc3.unknown.gz

struc4.names - struc4.train.gz - struc4.valid.gz - struc4.propertest.gz - struc4.unknown.gz

Sequence similarity (Homology)

Sequence similarity (usually implying homology) is detected by a PSI-BLAST (blastpgp) search against NRDB. Associations are constructed from the following set of predicates:

Predicate	Description
eval(Orf, SPID, EVal)	Orf is similar to SwissProt protein with accession SPID, with e-value Eval.
desc(SPID,X)	SwissProt protein SPID had description word X
db_ref(SPID,X)	SwissProt protein SPID had a database reference to the X database
keyword(SPID,X)	SwissProt protein SPID had keyword X
species(SPID,X)	SwissProt protein SPID belonged to species X
species(SPID,X)	SwissProt protein SPID belonged to classification X in the species taxonomy
sq_len(SPID,X)	SwissProt protein SPID had sequence length X
mol_wt(SPID,X)	SwissProt protein SPID had molecular weight X

Each association is converted into a boolean attribute representing the answer to: does this ORF have this association or not? The list of associations and their corresponding attribute numbers is homAllnondup.nums.

Boolean (c4.5) data:
The data is available here in C4.5 format. .names files give attribute names. .train, .valid and .propertest are training, validation and propertest data respectively. .unknown is data about ORFs whose functions were listed in categories 99,0,0,0 (unclassified proteins) or 98,0,0,0 (classification not yet clear cut), or those that had no function listed at this level in the hierarchy. The numbers 1-4 correspond to levels in the MIPS/GO functional classification (1 is the most general, 4 is most specific).

GO including IEA annotations

hom1.names - hom1.train.gz - hom1.valid.gz - hom1.propertest.gz - hom1.unknown.gz

hom2.names - hom2.train.gz - hom2.valid.gz - hom2.propertest.gz - hom2.unknown.gz

hom3.names - hom3.train.gz - hom3.valid.gz - hom3.propertest.gz - hom3.unknown.gz

hom4.names - hom4.train.gz - hom4.valid.gz - hom4.propertest.gz - hom4.unknown.gz

GO excluding IEA annotations

hom1.names - hom1.train.gz - hom1.valid.gz - hom1.propertest.gz - hom1.unknown.gz

hom2.names - hom2.train.gz - hom2.valid.gz - hom2.propertest.gz - hom2.unknown.gz

hom3.names - hom3.train.gz - hom3.valid.gz - hom3.propertest.gz - hom3.unknown.gz

hom4.names - hom4.train.gz - hom4.valid.gz - hom4.propertest.gz - hom4.unknown.gz

MIPS automatic annotations

hom1.names - hom1.train.gz - hom1.valid.gz - hom1.propertest.gz - hom1.unknown.gz

hom2.names - hom2.train.gz - hom2.valid.gz - hom2.propertest.gz - hom2.unknown.gz

hom3.names - hom3.train.gz - hom3.valid.gz - hom3.propertest.gz - hom3.unknown.gz

hom4.names - hom4.train.gz - hom4.valid.gz - hom4.propertest.gz - hom4.unknown.gz

MIPS manual annotations

hom1.names - hom1.train.gz - hom1.valid.gz - hom1.propertest.gz - hom1.unknown.gz

hom2.names - hom2.train.gz - hom2.valid.gz - hom2.propertest.gz - hom2.unknown.gz

hom3.names - hom3.train.gz - hom3.valid.gz - hom3.propertest.gz - hom3.unknown.gz

hom4.names - hom4.train.gz - hom4.valid.gz - hom4.propertest.gz - hom4.unknown.gz

SCOP

SCOP superfamily predictions, as made by the Superfamily server. Attributes are the classes in the SCOP hierarchy. Values are the e-values of a match to that family. Values of 10 are recorded where there is no match.

GO including IEA annotations

scop1.names - scop1.train.gz - scop1.valid.gz - scop1.propertest.gz - scop1.unknown.gz

scop2.names - scop2.train.gz - scop2.valid.gz - scop2.propertest.gz - scop2.unknown.gz

scop3.names - scop3.train.gz - scop3.valid.gz - scop3.propertest.gz - scop3.unknown.gz

scop4.names - scop4.train.gz - scop4.valid.gz - scop4.propertest.gz - scop4.unknown.gz

GO excluding IEA annotations

scop1.names - scop1.train.gz - scop1.valid.gz - scop1.propertest.gz - scop1.unknown.gz

scop2.names - scop2.train.gz - scop2.valid.gz - scop2.propertest.gz - scop2.unknown.gz

scop3.names - scop3.train.gz - scop3.valid.gz - scop3.propertest.gz - scop3.unknown.gz

scop4.names - scop4.train.gz - scop4.valid.gz - scop4.propertest.gz - scop4.unknown.gz

MIPS automatic annotations

scop1.names - scop1.train.gz - scop1.valid.gz - scop1.propertest.gz - scop1.unknown.gz

scop2.names - scop2.train.gz - scop2.valid.gz - scop2.propertest.gz - scop2.unknown.gz

scop3.names - scop3.train.gz - scop3.valid.gz - scop3.propertest.gz - scop3.unknown.gz

scop4.names - scop4.train.gz - scop4.valid.gz - scop4.propertest.gz - scop4.unknown.gz

MIPS manual annotations

scop1.names - scop1.train.gz - scop1.valid.gz - scop1.propertest.gz - scop1.unknown.gz

scop2.names - scop2.train.gz - scop2.valid.gz - scop2.propertest.gz - scop2.unknown.gz

scop3.names - scop3.train.gz - scop3.valid.gz - scop3.propertest.gz - scop3.unknown.gz

scop4.names - scop4.train.gz - scop4.valid.gz - scop4.propertest.gz - scop4.unknown.gz

InterPro

This data was derived using InterProScan.

The mapping from associations to the corresponding attribute numbers is in interprotrain.out.s100.d4.namesmap

GO including IEA annotations

interpro1.names - interpro1.train.gz - interpro1.valid.gz - interpro1.propertest.gz - interpro1.unknown.gz

interpro2.names - interpro2.train.gz - interpro2.valid.gz - interpro2.propertest.gz - interpro2.unknown.gz

interpro3.names - interpro3.train.gz - interpro3.valid.gz - interpro3.propertest.gz - interpro3.unknown.gz

interpro4.names - interpro4.train.gz - interpro4.valid.gz - interpro4.propertest.gz - interpro4.unknown.gz

GO excluding IEA annotations

interpro1.names - interpro1.train.gz - interpro1.valid.gz - interpro1.propertest.gz - interpro1.unknown.gz

interpro2.names - interpro2.train.gz - interpro2.valid.gz - interpro2.propertest.gz - interpro2.unknown.gz

interpro3.names - interpro3.train.gz - interpro3.valid.gz - interpro3.propertest.gz - interpro3.unknown.gz

interpro4.names - interpro4.train.gz - interpro4.valid.gz - interpro4.propertest.gz - interpro4.unknown.gz

MIPS automatic annotations

interpro1.names - interpro1.train.gz - interpro1.valid.gz - interpro1.propertest.gz - interpro1.unknown.gz

interpro2.names - interpro2.train.gz - interpro2.valid.gz - interpro2.propertest.gz - interpro2.unknown.gz

interpro3.names - interpro3.train.gz - interpro3.valid.gz - interpro3.propertest.gz - interpro3.unknown.gz

interpro4.names - interpro4.train.gz - interpro4.valid.gz - interpro4.propertest.gz - interpro4.unknown.gz

MIPS manual annotations

interpro1.names - interpro1.train.gz - interpro1.valid.gz - interpro1.propertest.gz - interpro1.unknown.gz

interpro2.names - interpro2.train.gz - interpro2.valid.gz - interpro2.propertest.gz - interpro2.unknown.gz

interpro3.names - interpro3.train.gz - interpro3.valid.gz - interpro3.propertest.gz - interpro3.unknown.gz

interpro4.names - interpro4.train.gz - interpro4.valid.gz - interpro4.propertest.gz - interpro4.unknown.gz

Expression

Some of the microarray data from NASC. Results of 43 experiments from cds between Dec 2002 and Jan 2004 using signal, detection call and detection P-values.