Christie Hunter


Peptide grouping in ProteinPilot software 5.0

A new feature in ProteinPilot software 5.0 is the ability to do peptide grouping. When analyzing proteomics data from LC-MS/MS, one must take into account the multiple pieces of evidence supporting a peptide as well as any ambiguity that might still exist in the peptide identification. Here, we think of a peptide group as a set of peptide hypotheses competing to explain the same physical peptide, so everything with identical molecular property (sequence and modifications) or similar MW and RT will be assembled into a peptide group. We group together all competitive hypotheses for a given physical peptide stemming from these three levels: multiple precursor charge states, multiple MS/MS spectra of a given charge state (within or between fractions), and multiple answers for a given MS/MS spectrum. Much like protein grouping, sometimes you cannot unambiguously narrow down to a single peptide so a group of answers is reported to preserve the ambiguity observed.

ProteinPilot software assigns a peptide locus number (shown as to each distinct peptide group, where the integer is the distinct peptide index and the thousandths position is the rank. When working from the Distinct Peptide list, the winning peptide member for that group (locus x.001) will be what you typically would use for downstream analysis.  The Best Confidence (peptide) is the max confidence from all peptides within the peptide group, all members of the group will then have the same Best Confidence assigned to them. The Best Hypothesis Confidence is the best confidence of all hypotheses in the group (for each row considering redundant IDs).  Multiple aspects go into deciding which hypothesis is the ‘winning peptide’ for the group (confidence, score, MS1 intensity, etc.), so while it will often be the case that the Best Confidence (Peptide) and Best Confidence (Hypothesis) are frequently the same for the winning member of the peptide group (x.001), this may not always be true (Example 2 below).

The column in the Distinct Peptide Export called “All N” is used to reflect the redundancy within the peptide group, where the same peptide can be found in multiple protein group winners (where each N number represents a protein group with unique evidence). Because the peptide grouping algorithm groups together everything with approximately equal MW and RT, some hypotheses from different MS/MS spectra that are not associated with a declared protein will be placed in the peptide group but will have a blank in “All N” column (Rows 40-41 in Example 1). Note these will only show up as the winners of the peptide group at the very end of the export, where we collect all the peptide groups that have winning hypotheses that are not associated with a declared protein.

Final note, when handling multiple LC-MS data files, we only merge peptide hypotheses into a single peptide group if we identify the same peptide sequence from the MS/MS in different LC-MS/MS experiments.  Proximity in MW or elution time is ignored across different samples.

 Example 1: Distinct Peptide Export example.

Example 2: Note when working from the distilled distinct peptide export, you may not see all the hypothesis but if you look in the raw results (ie. in XML format) you will see everything. In this case, there are 2 acquisitions with same hypothesis.

The top acquisition is chosen as representative due to higher precursor signal, so the confidence of the Best Hypothesis is 95.7. But since the overall maximum confidence is 99 (from the 2nd acquisition), this is displayed in Best Conf (Peptide).

Another grouping feature that is reported in the distinct peptide exports are the concepts of sequence groups and sequence families.

A Sequence Group is a group of peptides that share identical base sequences, but with possibly different modifications. Because members of the same sequence group may have different molecular weights, they may not all belong to the same peptide group as described above. A Sequence Family is a group of peptides that map to the same region of the protein; however they do not need to have identical sequences to each other because the sequence family includes all identified sequences containing, or contained by, a particular peptide due to cleavage variations.

The ‘F Seq Signal’ and ‘F Seq Fam Signal’ columns in the distinct peptide export represent the fraction of the peptide intensity relative to the peptide intensities of the rest of the peptides in its sequence group or sequence family, respectively. The ‘Max Peptide In Seq Fam’ column contains a Boolean value that is true only if the peptide has the highest intensity relative to the rest of the members in its Sequence Family. This allows ready identification of the major digestion product detected from a region of a protein.

Example 3: Sequence Group and Sequence Family columns in the distinct peptide export.

A simple example of the calculations is presented here, in reference to the figure shown in Example 3.

There are 4 peptides listed:

  • The first 3 are identical base sequences but with alternative modification hypotheses.
  • Last peptide is slightly different cleavage product, but maps to the same region of the protein (notice the slightly different start position to the other 3 peptides).

Therefore, the first 3 peptides belong to the same Sequence Group since they all have identical base sequences, and all 4 peptides belong to the same Sequence Family since they map to the same region of the protein.

The calculations can then be reproduced for the first peptide as follows:

  • F Seq Signal = Intensity / Sum (intensities in Sequence Group) = 8883.25 / (8883.25 + 10135.86 + 580.43) = 0.4532 (matches col AC)
  • F Seq Fam Signal = Intensity  / Sum (intensities in Sequence Family) = 8883.25 / (8883.25 + 10135.86 + 580.43 + 2381.45)
    = 0.4041 (matches col AD)
  • Max Peptide in Seq Fam = max (intensities in Sequence Family) = 10135.86 (which matches the second peptide, so it is true in col AE, all other hypotheses are false)

Note: the sum of the fractions within the same Sequence Group (column AC) or same Sequence Family (column AD) is equal to 1.

To reproduce these calculations, the distinct peptide export needs to be modified to export all hypotheses, because hypotheses that have a confidence less than 15% are excluded by default.

This is controlled by the following key in the ProteinPilot.exe.config file:

<add key=”DistinctPeptConfThreshOnPeptExport” value=”0.15″></add>

It can be reduced to a low value (e.g. 0.01) to have all hypotheses included in the export.



Join the discussion

Comment below on this article and our team will answer your questions.


Submit a Comment