How are the scores computed?

The combined score is computed by combining the probabilities from the different evidence channels and corrected for the probability of randomly observing an interaction. For a more detailed description please see von Mering, et al. Nucleic Acids Res. 2005

To combine the scores we add the probabilities for each of the channels. To each channel, a 'prior' has been added to account for the probability that two randomly picked proteins are interacting. Before combing the channels the 'prior' has to be removed and then added back again to the combined score. Here is how the combined score is computed for an interaction.

  1. For each of the scores for the individual channels (s_i) remove the prior (p=0.041):

    s_i_noprior = (s_i - p) / (1 - p)
    
  2. Combine the scores of the channels:

    s_running_total = 1 - s_running_total * (1 - s_i_noprior)
    
  3. Add the prior back (once):

    s_total = s_running_total + p * (1 - s_running_total)
    

In code it would be something like this:

s_exp = 0.621
s_txt = 0.585
p = 0.041
s_exp_nop = (s_exp - p) / (1 - p)
s_txt_nop = (s_txt - p) / (1 - p)
s_tot_nop = 1 - (1 - s_exp_nop) * (1 - s_txt_nop)
s_tot = s_tot_nop + p * (1 - s_tot_nop)

Also, homology correction are applied to the co-occurrence and text-mining scores.

effective co-occurrence score = co-occurrence score * (1 - homology score)

effective text-mining score = text-mining score * (1 - homology score)

How can I obtain the complete data set? Top ↑

STRING is available for licensing - both for commercial and for academic institutions. Sign and send by regular mail the academic license agreement (wait for approval) and download the SQL database.

I am interested in retrieving data of a few particular interaction for my script. How do I go about to get it? Top ↑

Please use the API If you plan to submit thousands of HTTP requests first make sure if the information you are seeking is not avaliable to download. If you still would like to use API please pause for at least a second between each API call. Too many concurent calls may slow down the server for all users.

How can I save a certain network? Top ↑

You can find your network avaliable for download under Tables/Exports tab. The network is avaliable to download in the variety of formasts: Bitmap Image, Scalable Vector Graphics, XML Summary (Proteomics Standards Initiative), Graph Layout, Protein sequences in FASTA format, and Text Summary of interaction scores.

For my latest manuscript, I would like to use a picture in svg-format produced by STRING. Must I ask for permission? Top ↑

Nope. But we appreciate if you cite us

How to cite STRING? Top ↑

Szklarczyk et al. Nucleic Acids Res. 2015 43(Database issue):D447-52

How can I trace the origin of the different evidences for an interaction? Top ↑

This information is available if you click on an edge of the graph in the network view.

Which databases does STRING extract experimental data from? Top ↑

BIND, DIP, GRID, HPRD, IntAct, MINT, and PID.

From which databases does STRING extract curated data? Top ↑

Biocarta, BioCyc, GO, KEGG, and Reactome.

How do I extract purely experimental data? Top ↑

Uncheck all boxes, but the "Experiments" in the box "active interaction sources" under Data Settings tab.

I want to extract PPI for a given species, but only from experimental data and not from transferred from other species. Top ↑

You need to sign the license agreement to download the file: 'protein.links.full.txt.gz'. Use the file to get the direct experimental evidence, for example by, printing the columns for protein1 protein2 and experiments (i.e., columns 1,2,10) and grep for the 'species_id' (e.g., 9606 for human).

zgrep ^"9606\." protein.links.full.txt.gz | awk '($10 != 0) { print $1, $2, $10 }' > direct_experimental_data_human.txt

I want to differentiate physical interactions from functional ones within STRING Top ↑

In order to get the physical interactions you need to download proteins.actions.(version).txt.gz from download section. If the interaction is marked as "binding" you can be sure that this is a physical interactions. If interaction does not have "binding" specified (i.e., antyhing else) it may be either physical or functional.

What does the columns in proteins.actions file mean? Top ↑

Here is a brief explanation of the column names for the action evidence.

  • item_id_a - identifier of protein A
  • item_id_b - identifier of protein B
  • mode - type of interaction (e.g. "reaction", "expression", "activation", "ptmod"(post-translational modifications), "binding", "catalysis")
  • action - the effect of the action ("inhibition", "activation")
  • a_is_acting - the directionality of the action if applicable (1 gives that item_id_a is acting upon item_id_b)
  • score - the combined score of all interactions in string.
  • source - describes the source of inferred interaction is taken (bind, biocarta, biocyc, dip, grid, hprd, intact, kegg_pathways, mint, pdb, PID, reactome).
  • transferred_sources - Sources used for transfer of evidence by homology/orthology from another species. (if two proteins are interacting in several other species it is plausible that it is interacting in close species where it has not been observed).

If the column a_is_acting is 1 (TRUE) then this means that protein_a is acting on protein_b. On the other hand, if it is 0 (FALSE) then the opposite is not necessarily true. In this case the zero can indicate that directionality of the interaction is not known or not applicable (e.g. binding).

The file is redundant. If the the action goes in the other direction, then this will be indicated at another line where the name identifiers are swapped between the 1st and the 2nd column.

It is stated that STRING is locus-based and only a single translated protein per locus is stored. What does this mean? Top ↑

STRING uses one protein per gene. If there is more than one isoform per gene, we usually select the longest isoform, unless we have information that suggest that other isoform regarded as cannonical (e.g., proteins in the CCDS database).

Does STRING contain any Gene Ontology information? We see that there is a table called funcats. What type of information does this contain? Top ↑

The "funcats" contain the functional categories as defined for the COG database. We import the GO complexes and use these for inferring interaction. GO terms themselves are projected for future version.

Not directly, but by searching for "wing" in Drosophila will return genes that have been annotated/described as such, each of which is associated with a network.

Does the database give a PubMed Reference ID for each interaction? Top ↑

Interactions that have only predicted evidence do not have an PMID. Text-mining evidence may also stem from other sources, such as OMIM. Apart from the above, interactions come with at least one pubmed reference id. Some cases have several different and others have the same pmid (e.g., for external repositories, the interaction have the pmid of the publication of the database).

Are there different types of sets besides protein networks and pathways? What is the difference between a "set" and a "collection"? Top ↑

The different types of sets are networks, pathways, complexes, and PDB structures with more than one protein. The "sets_items" are members in the evidence sets. An interaction exists if two lines have the same set_id. The "sets" contain information of the set_ids, for example, from which "collection" they originate from. The "collections" are the different resources of data from which STRING imports data (for the channels 'experiments' and 'databases').

How do I access STRING using GI numbers. If it does, could you use 90 kD heat shock protein (GI:306891) as an example to let me know what should I type in protein name using NCBI GI number. Top ↑

The GI accession numbers are to track sequence histories of GenBank. STRING does use these number nor does it keep track of them, mainly because STRING is locus based. Also, STRING imports its sequences from Ensemble and RefSeq. If you need to cross reference to a particular entry in STRING from a GenBank record, you use the accession id of the GenBank nucleotide record. For example, 90kDa heat shock protein in human, will be M16660, which will give you this network.

Is there a legend or key for the different colored lines? (Is there a specific difference for each color?) (Is there a key for the colored lines in the evidence view?) Top ↑

Yes. You can always find the legend for your view under the "Legend" tab below the network.

I assume the arrows mean activation and the red perpendicular lines mean repression, but what to the circles at the end of the line represent? Top ↑

If we know a directionality of the action is indicated by the symbol at the end of the edge next to the protein that is acted upon. Down-Regulation is a red bar and up-regulation is a green arrow, as you say. Yellow circle is describes that we know the directionality of the interaction e.g. ("A" acts upon "B"), but we do not know the if the result of the interaction (e.g., if it is up- or down-regulated).

At each node, there are icons inside the protein spheres. Is there a key for these icons? Do the icons represent the different protein functions (DNA binding, enzyme, etc.) Top ↑

The icons do not have any particular meaning other than that there is a structure associated with them. This can be either a PDB entry for the protein itself or a close homolog. If no PDB entry exists we look if their structure available by homology modeling from swiss-model. A small bubble (without icon) means that there is no structural information available. You can disable these structure previews in the "View Settings" tab.

I want to download the data for a particular network that I have found while browsing the STRING web-interface Top ↑

You can download your network in the "Tables / Exports" tab below your network. You can chose to download your data in a number of formats. The simplest to use is probably "Text Summary (TXT - simple tab delimited flatfile)".

I need all the interactions for a particular organism. Top ↑

You can download all data from the download section. At the bottom of the page there is a box where you can choose the organism of your interest. For example you can write "human" or "dog" there. When you click update all the files will automatically contain only the information about the taxon of your choice. STRING will also append ncbi taxonomy identifier prefix to each file-name.

Alternatively you can download not filtered file e.g. "protein.links.txt.gz" and parse it manually using the NCBI taxonomy identifier of the organism of your interest. You can find out here if the organism you are looking for exists in STRING along with its taxonomy identifier. Assuming you are using unix based operating system (including macs) you can parse the file like this (9606 is tax id of human):

zgrep "^9606\." protein.links.txt.gz > human.links.tsv

How to extract high confidence (>0.7) interactions from information on "combined score" in "protein.links.txt.gz" Top ↑

Here you can simply use awk to condition on the third column that contains the combined_score. Note that the scores are multiplied by 1000 to make them integers. I also assume that you only want evidence from human. Try the following:

zgrep "^9606\." protein.links.txt.gz | awk '($3 > 700) {print}'

How to retrieve only the direct evidence in human, not transferred. Top ↑

You need the file: "protein.links.full.txt.gz", from which you can retrieve the columns like above and write it to a file.

zgrep ^"9606\." protein.links.full.txt.gz  | awk '($16 > 700) { print $1, $2, $3, $5, $6, $7, $8, $10, $12, $14, $16 }' > PPI_700_human.txt

The first and the second columns contains the STRING external identifiers. The last column contains the integrated scores including the homology transferred evidence.

In the file: "protein.links.txt" are the scores multiplied by 1000? Top ↑

Yes, the scores are multiplies by a factor 1000 (and truncated). 872 in the file means a STRING score of 0.872

Are the colors assigned to nodes significant? Top ↑

There colored nodes are your input (in case multiple-protein input) or first shell of interactors (in case of single-protein input). Grey nodes are proteins connected to your input or 2nd shell of interactors for multiple- and single input respectively. There is no particular meaning of the node color iteslef. They are used as a visual aid to identify which node goes with which description in list of input below the network ("Legend" tab) and in the evidence viewers.

What are the 1st and 2nd shell interactors? Top ↑

The 1st shell iteractors are the proteins directly associated with your input protein(s). 2nd shell of iteractors are the proteins associated with the proteins from the 1st shell or with your input protein(s). It can happen that a 2nd shell protein can be directly connected to your input protein(s), but it will usually have a weaker association and therefore it would not show up among the specified number of the 1st shell iteractors. You can recognize which shell the protein belongs to by looking at the color of the bubble, as 2nd shell proteins are always grey.

Why are some nodes smaller and some nodes bigger? Top ↑

The different size of the node only reflects that there is structural information associated with the protein. (i.e., it is larger to fit the thumbnail picture). You can disable the previews in the "View Settings" tab which will render the bubbles in the same size.

How to I map my proteins to STRING identifiers? Top ↑

You can use the file of 'protein.aliases.txt' available from the download page This file has four columns: species_ncbi_taxon_id, protein_id, alias, source. To figure out which is the string identifier for trpA in E. coli K12, you can do something like this in your terminal:

zgrep ^511145 protein.aliases.txt.gz | grep trpB

which would return:

511145  b1261   trpB    BLAST_UniProt_GN RefSeq

from this you can get the string name by concatenating the two first column with a period (511145.b1261)

Is there an automatic way of mapping proteins to STRING? I need mappings for more three thousand proteins. Top ↑

A convenient way of mapping your proteins to STRING entries is to use the STRING'S API As an example, for a single protein, the alias can be retrieved by:

http://string-db.org/api/tsv/resolve?identifier=trpA&species=511145

Alternatively, instead of making on call per protein you can try to all the identifiers for a list of protein (separated by carriage return character '%0D'):

http://string-db.org/api/tsv/resolveList?identifiers=trpA%0DtrpB&species=511145

In such cases you may have a problems with the length limit of the URL, but this can be circumvented by sending the request as a HTTP POST request. For example using cURL:

curl -d "identifiers=trpA%0DtrpC%0DtrpB%0DtrpD\&species=511145" string-db.org/api/tsv/resolveList

The protein interactions from the STRING website via web API calls. What do the score columns mean (for example, nscore, fscore, tscore, etc)? Top ↑

Here is a summary.

  • nscore - neighborhood score, (computed from the inter-gene nucleotide count).
  • fscore - fusion score (derived from fused proteins in other species).
  • pscore - cooccurence score of the phyletic profile (derived from similar absence/presence patterns of genes).
  • hscore - homology score, the degree of homology of the interactors (normally not reported in STRING).
  • ascore - coexpression score (derived from similar pattern of mRNA expression measured by DNA arrays and similar technologies).
  • escore - experimental score (derived from experimental data, such as, affinity chromatography).
  • dscore - database score (derived from curated data of various databases).
  • tscore - textmining score (derived from the co-occurrence of gene/protein names in abstracts).

How do I select a reasonable score cut-off value for my analysis? Top ↑

You can use the score cut-off to limit the number of interactions to those that have higher confidence and are more likely to be true positives. Setting the cutoff lower, will increase coverage but also the fraction of false positives. You have to choose some arbitrary number based on the number of interactions you need for you analysis.

How do I import several interactions from STRING into Cytoscape. Top ↑

Cytoscape supports "tab separated values" file format. Download the "protein.links file" (from STRING download page), extract the interactions for you want (use grep or copy-paste), and load the processed file into cytoscape.

You can link to a STRING network as follows:

http://string-db.org/newstring_cgi/show_network_section.pl?identifier=9606.ENSP00000234798

Mapping for the "identifier" parameter can be found in the alias files (http://string-db.org/newstring_download/protein.aliases.txt.gz), but conveniently, this is not necessary, since STRING will recognize other types of accession ids, for example swissprot ids:

    http://string-db.org/newstring_cgi/show_network_section.pl?identifier=Q9NRR2

You could even link by looking for the gene name and specifying the "species" parameter with the taxon id, but this is less stable.

    http://string-db.org/newstring_cgi/show_network_section.pl?identifier=TPSG1&species=9606

Now, if a STRING user has already specified some settings by the cookie, then they will have a different "look" from the network. Unless you have a strong aversion to long URLs, I suggest that you explicitly specify in the link how you want you network to look by also applying the parameters: "all_channels_on", "interactive", "network_flavor", and "targetmode".

    http://string-db.org/newstring_cgi/show_network_section.pl?identifier= Q9NRR2&all_channels_on=1&interactive=no&network_flavor=evidence&targetmode=proteins

If you want you can generate a network preview, you can do this by an URL in a image tag. This is in fact an API call that generates the image on demand, which you can scale down to an appropriate size.

    <img src="http://string-db.org/api/image/network?targetmode=proteins&network_flavor=evidence&identifier=ENSP00000234798" height="200px">

Limitation on the number of proteins? Top ↑

The web interface is not designed to handle large number of proteins and it is often difficult to visually interpret networks of large number of nodes. In such cases, it is better to process your data using the download files

Problems to download large files? Top ↑

To download files it is convenient to work in a terminal window. For example, the program "curl" with the option "-C -" is useful for downloading large files if you are on a shaky internet connection.

The downloaded file is really large. Which text editor should I use to view it? Top ↑

It is better not to open the file at all and extract the information from the file. On unix based systems (linux, mac) the safest way to peak and browse large files is to use cat/zcat (the latter is used with gzipped files) piped into less command.

    zcat protein.links.v10.txt.gz | less -S