AnnotationTerms and Annotations Systematic Name ... (Or Ensemble ID or Locus ID or ORF name) (From SGD's Glossary of Terms)
All S. cerevisiae ORFs are designated by a symbol consisting of three uppercase letters followed by a number and then another letter, as follows: Y (for "Yeast"); A - P for the chromosome upon which the ORF resides (where "A" is chromosome I, up to "P" for chromosome XVI); L or R (for Left or Right arm); a 3-digit number corresponding to the order of the open reading frame on the chromosome arm (starting from the centromere and counting out to the telomere); and W or C for whether the open reading frame is on the "Watson" or "Crick" strand (where "Watson" runs 5' to 3' from left telomere to right telomere). Most ORF designations by the systematic sequencing groups use a predicted 100 amino acid polypeptide as the minimum size limit, except when a smaller gene has already been characterized and localized to the chromosomal sequence. When a new ORF is discovered on a chromosome that has already had its ORF's named, the new ORF will usually be named by taking the name of an adjacent ORF and adding an "A" or "B" to the end of it (this avoids re-numbering all the distal ORF's). Ex - YPR166C Yeast, P chromosome, Right arm, on the Crick strand Entrez Gene ID (From Gene FAQ on NCBI's website)
The GeneID integer is the same as the LocusID seen in LocusLink. We plan to provide the GeneID equal to the LocusID as long as LocusLink continues to be offered Entrez Gene ID and GO Terms (From Gene FAQ on NCBI's website)
NCBI reports GO terms appropriate for a GeneID by integrating information from the following sources:
For all genomes but human, a species-specific gene-identifier (FBgn id, MGI id, RGD ID) is converted to the GeneID. For human, the connection is made from common protein accessions. Most current gaps in the human set, therefore, result from lags in matching protein accessions to GeneIDs. According to Gene's current data flow, any association of a protein accession with more than one gene record must be reviewed by a curator. This multiplicity can be frequently with gene families where multiple genes encode the same protein sequence. Entrez Gene currently reports, and uses for indexed queries, only the explicit GO term or terms assigned to any gene. It does not support querying at any node of the GO graph, nor retrieving all genes that match terms at more specific nodes based on a query at a higher node. IntAct EBI- IDs These are a internal IDs which have a direct mapping to UniprotKB IDs Mapping Between Annotation Types Our problem consists of data in 3 parts. Lab data attached to Systematic Names, Gene Ontology information attached to it's own internal ID, and graphs from the IntAct database using their internal ID. Our goal is to take the GO information and the Lab data, and use this on the graphs. We will use the Hazbun dataset in the IntAct database as our graph for this particular tutorial. Using the Rintact library, which may be replaced with a more robust PSI2.5 reader in the future, we can extract the ID information present in the XML file as folows:
int1 = psi25interaction("14690591_hazbun-2003-1_01.xml")
int1 = interactors(int1)
> summary(int1)
uniprotId geneName
O13297 : 1 : 401
O13329 : 1 AAC2 : 1
O13520 : 1 AAC3 : 1
O13532 : 1 AAD4 : 1
O13539 : 1 AAD6 : 1
O13540 : 1 (Other):1226
(Other):1636 NA's : 11
fullName
Ubiquitin-conjugating enzyme E2-18 kDa : 2
(R,R)-butanediol dehydrogenase : 1
1,3-beta-glucan synthase component FKS1 : 1
1,3-beta-glucan synthase component FKS3 : 1
1-(5-phosphoribosyl)-5-[(5-phosphoribosylamino)methylideneamino] imidazole-4-carboxamide isomerase: 1
13 kDa ribonucleoprotein-associated protein : 1
(Other) :1635
locusName orfName organismName taxId
YAL005C: 1 L9470.14: 2 Saccharomyces cerevisiae:1642 4932:1642
YAL007C: 1 00953 : 1
YAL008W: 1 06105 : 1
YAL011W: 1 06139 : 1
YAL014C: 1 06163 : 1
YAL016W: 1 (Other) :920
(Other):1636 NA's :716
If we look at this summary closely we can already see how badly confused this annotation task can be. The "orfName" column clearly contains IDs that do not follow SGD's description of ORF names. Instead, the "locusName" column contains the IDs we expected to see. Despite this, we do seem to have the information we need. We are able to tie the internal IntAct ID's to both Uniprot ID and ORF names. > int1["EBI-21252","locusName"] [1] "YBL048W" > int1["EBI-21252","uniprotId"] [1] "P38192" We can make the jump from Uniprot ID to GO term using their mapping tool. Uniprot's Mapping Tool by First converting it to an Entrez Gene ID which can then be mapped directly to GO terms using the "GO" library from bioConductor. (From "library(help=GO)")
GOENTREZID2GO Entrez Gene to Gene Ontology (GO) mapping This multiple-term mapping process does introduce a new set of errors that are in-addition to any errors inherently present in the data. Not all terms will map. Some terms will map multiply. For example, straight off the uniprot ID "P38192" which we mapped from "EBI-21252" fails to map to an Entrez Gene ID. Usinga = int1[,1] names(a) = NULL write.table(a,file="uniProtList1", col.names=FALSE, row.names=FALSE) We get a list of the IDs which need mappings. If we paste this entire list into the mapping tool we get the following disheartening message: "1,642 out of 1,642 identifiers mapped to 0 identifiers in the target data set" If we then strip the quotation marks out of the file and try again, we are greeted with a much nicer: "1,642 out of 1,642 identifiers mapped to 1,510 identifiers in the target data set" This mapping can then be downloaded and used in R to map from one ID to the next. The Entrez IDs can be used to search for GO terms directly. We did lose more than 100 IDs in this process. IDs which a little hand searching can be mapped successfully to GO terms. This is an issue I do not know how to resolve at this time. We can map the EntrezID using the previously mentioned structure > names(GOENTREZID2GO[["855036"]]) [1] "GO:0004842" "GO:0004842" "GO:0005783" "GO:0006333" "GO:0030433" "GO:0030433" "GO:0030433" If we cross check this with an Ensemble search, we see that: EntrezGene: UBC7, QRI8, 855036 GO: GO:0005783, GO:0006333, GO:0046870, GO:0019787, GO:0030433, GO:0006512, GO:0016874, GO:0004842, GO:0046686, GO:0006464, GO:0005515 The 4 unique terms that are found show up among the more general list in the ensemble search. The duplication of terms results from the multiple detection techniques listed for the mapping:
$`GO:0030433`
$`GO:0030433`$GOID
IGI
"GO:0030433"
$`GO:0030433`$Evidence
[1] "IGI"
$`GO:0030433`$Ontology
[1] "BP"
$`GO:0030433`
$`GO:0030433`$GOID
IMP
"GO:0030433"
$`GO:0030433`$Evidence
[1] "IMP"
$`GO:0030433`$Ontology
[1] "BP"
|