SPARQL / RDF

STRING provides a Resource Description Framework (RDF) data model and SPARQL endpoint, enabling you to access our protein-protein interaction data without using the traditional web page interface. One of the key benefits of the SPARQL endpoint, over the API, is its capacity to handle "federated queries". These federated queries allow for simultaneous querying and integration across multiple databases, eliminating the need for centralization and crafting individual parsers for each of the resource APIs. For instance, with a single query, you can amalgamate data about protein features from UniProt with data about interaction partners from STRING.

Note: This is not a stable release of the STRING SPARQL endpoint and the RDF data. Please contact us if you have any questions, feedback or requests.

RDF STRING data representation

In our effort to optimize the performance of the SPARQL endpoint, we've introduced a simplified RDF representation of STRING's data structure. In this representation, interactions are categorized based on the network type (functional or physical) and confidence thresholds. This system allows users to efficiently retrieve interactions of their chosen type and confidence level. We've achieved this categorization by encoding this data directly into the predicate, rather than as an attribute of the interaction. For performance and simplicity, we've also selected specific, widely-used name-spaces instead of using all known aliases, a distinction from the approach taken by the API.

It's crucial to highlight that the STRING API provides a wider array of functionalities. Unless your focus is on federated queries, semantic web applications, or other specific use cases where RDF excels, the STRING API would often be your preferred choice.

SPARQL endpoint

You can access the query editor and the endpoint itself (for your programmatic queries) under the following address:

https://sparql.string-db.org

The undelaying RDF data is currently only retrievable through the SPARQL endpint. If you would like a dump of the RDF data structure please contact us.

RDF Structure Top ↑

The STRING database uses an RDF data structure to represent protein interactions in a streamlined fashion. Within this structure, each interaction is represented as a triple, commonly referred to as subject-predicate-object. In this context, both the subject and object denote the proteins involved in the interaction. To clarify, if proteins A and B interact, this is represented twice: once with A as the subject and B as the object, and once with B as the subject and A as the object. Because both sides of the interaction can serve as either the subject or the object, simply querying for a particular protein as an object will retrieve all of its interactions for a given predicate.

The encoded interaction can have several distinct predicates, determined by both the network type and confidence level. There are two primary network classifications: functional and physical. Along with this, there are four confidence thresholds: 'highest', 'high', 'medium', and 'any'. This results in a set of eight different predicates, each a combination of a network type and confidence cut-off (for details see section 'interaction predicates'). This straightforward predicate structure reduces the complexity of your queries and accelerates the data retrieval.

Objects

description	prefix	URI
protein	protein:	http://string-db.org/network/
taxon	organism:	http://identifiers.org/taxonomy/
UniProt accession	uniprotkb:	http://purl.uniprot.org/uniprot/
Entrez GeneId	geneid:	http://www.ncbi.nlm.nih.gov/gene/
RefSeq protein	refseq:	http://www.ncbi.nlm.nih.gov/protein/
Ensembl protein	ensembl:	http://rdf.ebi.ac.uk/resource/ensembl.protein/

Protein predicates

description	prefix	URI
protein symbol	rdfs:label	http://www.w3.org/2000/01/rdf-schema#label
protein description	rdfs:comment	http://www.w3.org/2000/01/rdf-schema#comment
STRING organism	organism:	http://string-db.org/rdf/organism/
cross-reference to another name-space	rdfs:seeAlso	http://www.w3.org/2000/01/rdf-schema#seeAlso

Interaction predicates

network type	score threshold	prefix	URI
functional	>= 0.900	if_highest:	http://string-db.org/rdf/interaction/functional-highest-confidence-cutoff
functional	>= 0.700	if_high:	http://string-db.org/rdf/interaction/functional-high-confidence-cutoff
functional	>= 0.400	if_medium:	http://string-db.org/rdf/interaction/functional-medium-confidence-cutoff
functional	>= 0.000*	if_any:	http://string-db.org/rdf/interaction/functional-any-confidence-cutoff
physical	>= 0.900	ip_highest:	http://string-db.org/rdf/interaction/physical-highest-confidence-cutoff
physical	>= 0.700	ip_high:	http://string-db.org/rdf/interaction/physical-high-confidence-cutoff
physical	>= 0.400	ip_medium:	http://string-db.org/rdf/interaction/physical-medium-confidence-cutoff
physical	>= 0.000*	ip_any:	http://string-db.org/rdf/interaction/physical-any-confidence-cutoff

*database stores only interactions with the combined score >= 0.150

All namespaces and prefixes

# Protein

PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>  
PREFIX organism: <http://string-db.org/rdf/organism/> 
PREFIX taxon: <http://identifiers.org/taxonomy/> 
PREFIX protein: <http://string-db.org/network/>  

# Identifiers

PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX geneid: <http://www.ncbi.nlm.nih.gov/gene/> 
PREFIX refseq: <http://www.ncbi.nlm.nih.gov/protein/> 
PREFIX ensembl: <http://rdf.ebi.ac.uk/resource/ensembl.protein/> 

# Functional interactions

PREFIX if_highest: <http://string-db.org/rdf/interaction/functional-highest-confidence-cutoff> 
PREFIX if_high: <http://string-db.org/rdf/interaction/functional-high-confidence-cutoff> 
PREFIX if_medium: <http://string-db.org/rdf/interaction/functional-medium-confidence-cutoff>  
PREFIX if_any: <http://string-db.org/rdf/interaction/functional-any-confidence-cutoff> 

# Physical interactions

PREFIX ip_highest: <http://string-db.org/rdf/interaction/physical-highest-confidence-cutoff>
PREFIX ip_high: <http://string-db.org/rdf/interaction/physical-high-confidence-cutoff> 
PREFIX ip_medium: <http://string-db.org/rdf/interaction/physical-medium-confidence-cutoff>
PREFIX ip_any: <http://string-db.org/rdf/interaction/physical-any-confidence-cutoff>

Query Tutorial Top ↑

Now when we know the basic RDF structure, we can construct the queries and exectute them interactively here. Simply copy paste the below queries into the input box and click "Execute Query".

Let's start with something simple.

1. Retrieve all functional interaction partners for STRING protein '511145.b1260' at any confidence level.

PREFIX protein: <http://string-db.org/network/>  
PREFIX if_any: <http://string-db.org/rdf/interaction/functional-any-confidence-cutoff> 

SELECT ?partner 
WHERE { 
    protein:511145.b1260 if_any: ?partner 
}

For this we need to know the STRING protein identifier, so let's try to query with gene symbol.

2. Retrieve all physical interaction partners for protein with gene symbol 'CDK2' at high confidence level.

PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>  
PREFIX protein: <http://string-db.org/network/>  
PREFIX ip_high: <http://string-db.org/rdf/interaction/physical-high-confidence-cutoff> 

SELECT ?partner WHERE {
     ?protein rdfs:label "CDK2" . 
     ?protein ip_high: ?partner 
}

Notice that we got proteins from multiple species back (denoted by the protein prefix). This because there are multiple species in STRING with the symbol 'CDK2'. We should narrow down the search to the specific species, let's say human (9606).

2. Retrieve all physical interaction partners for human protein 'CDK2' at high confidence level.

PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>  
PREFIX organism: <http://string-db.org/rdf/organism/> 
PREFIX taxon: <http://identifiers.org/taxonomy/> 
PREFIX protein: <http://string-db.org/network/>  
PREFIX ip_high: <http://string-db.org/rdf/interaction/physical-high-confidence-cutoff> 

SELECT ?partner WHERE {
     ?protein organism: taxon:9606 .
     ?protein rdfs:label "CDK2" . 
     ?protein ip_high: ?partner 
}

That is better, still we get the unreadable STRING protein identifiers in return. Let's say we also want to get the protein gene symbol. In addition we will also change our threshold to the highest confidence level.

3. Retrieve gene symbols of all physical interaction partners for human protein 'CDK2' at highest confidence level.

PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX organism: <http://string-db.org/rdf/organism/>
PREFIX taxon: <http://identifiers.org/taxonomy/>
PREFIX protein: <http://string-db.org/network/>
PREFIX ip_highest: <http://string-db.org/rdf/interaction/physical-highest-confidence-cutoff>

SELECT ?partner ?partnerLabel WHERE {
     ?protein organism: taxon:9606 .
     ?protein rdfs:label "CDK2" .
     ?protein ip_highest: ?partner .
     ?partner rdfs:label ?partnerLabel
}

So that seems to work well. However when working with different databases you would like to use a common name-space. Good choice here would be a UniProt accession. CDK2 in UniProt has an accession 'P24941', so let's ask for that, instead of STRING Identifiers.

4. Retrieve gene symbols and UniProt AC of all physical interaction partners for UniProt AC 'P24941' at highest confidence level.

PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX organism: <http://string-db.org/rdf/organism/>
PREFIX taxon: <http://identifiers.org/taxonomy/>
PREFIX protein: <http://string-db.org/network/>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX ip_highest: <http://string-db.org/rdf/interaction/physical-highest-confidence-cutoff>

SELECT ?partnerUP ?partnerLabel WHERE {
     ?protein organism: taxon:9606 .
     ?protein rdfs:seeAlso uniprotkb:P24941 .
     ?protein ip_highest: ?partner .
     ?partner rdfs:label ?partnerLabel .
     ?partner rdfs:seeAlso ?partnerUP .
     FILTER(STRSTARTS(str(?partnerUP), str(uniprotkb:)))
}

One other helpful query to see all STRING proteins in the given organism.

5. Retrieve all E. Coli K-12 proteins, their associated gene symbols and descriptions.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX organism: <http://string-db.org/rdf/organism/>
PREFIX taxon: <http://identifiers.org/taxonomy/>
PREFIX protein: <http://string-db.org/network/>

SELECT ?protein ?geneLabel ?geneDescription WHERE {
    ?protein organism: taxon:511145 .
    ?protein rdfs:label ?geneLabel .
    ?protein rdfs:comment ?geneDescription
}

Also something more complex:

6. Retrieve all D. melanogaster proteins that interact both with smoothened and hedgehog.

PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX organism: <http://string-db.org/rdf/organism/>
PREFIX taxon: <http://identifiers.org/taxonomy/>
PREFIX protein: <http://string-db.org/network/>
PREFIX ip_highest: <http://string-db.org/rdf/interaction/functional-highest-confidence-cutoff>

SELECT ?partner ?partnerLabel WHERE {
    ?protein1 organism: taxon:7227 .
    ?protein1 rdfs:label "smo" .
    ?protein1 ip_highest: ?partner .

    ?protein2 organism: taxon:7227 .
    ?protein2 rdfs:label "hh" .
    ?protein2 ip_highest: ?partner .

    ?partner rdfs:label ?partnerLabel
}