Useful FAQ document to enable researchers to focus on their scientific discoveries and insights rather than the complexities of data management.
Mass spectrometers generate data comprising a large collection of data points. All metabolomics researchers aim to transform these data points into identified metabolites and their quantities, which can then be translated into meaningful biological insights. Each acquired data point is multidimensional, typically consisting of m/z (mass-to-charge ratio), intensity, and retention time (RT), presented in profile or centroid form.
What is the difference between centroid vs. profile data?
Figure 1. Comparison of the different data forms for m/z 829.54 (a) profile data and (b) centroid data
Profile data offers the most comprehensive and detailed representation of a sample. Data is collected as a continuous signal across the m/z range, accurately depicting the shape of the peaks formed by detected ions. By containing all data points, profile data preserves the full complexity of the sample, including minor variations and noise, which are crucial for resolving closely eluting compounds and ensuring accurate quantitation. This allows researchers to perform in-depth analysis and apply sophisticated post-processing techniques, as all raw information is available. Converted profile data is typically larger than centroid data, making processing and interpretation more complex and time-consuming.
Centroid data, on the other hand, is a more concise and simplified form of data. Each mass peak is represented by a single data point corresponding to the best estimate of the m/z value (the center of mass of the peak where half the area of the peak is on each side of this centroid m/z value), along with intensity and retention time (RT). By summarizing the information, centroid data significantly reduces the volume of data, making storage and computational analysis more efficient. However, this reduction process involves some loss of detailed information, such as the exact shape and width of the peaks, which can obscure minor variations and noise present in the profile data. Also, related information, which can be calculated from the shape of the peak and others mass resolution, is lost.
When processing metabolomics data, it is important to consider how it has been collected and what processing software or downstream processing you plan to apply. SCIEX instruments collect data in profile mode, and although our software can process this data, most third-party software requires data to be centroided.
My SCIEX instrument generates both .wiff and .wiff2; what is the difference?
Several files are saved during acquisition. The .scan file contains the actual raw spectral data. Whether you open the data from .wiff or .wiff2 files, they both access the same .scan file. The .wiff and .wiff2 files store metadata such as sample name and acquisition date, but no spectral data is stored in these files. The older .wiff file is maintained primarily for compatibility with third-party software that has not yet been updated to use the new data API (data Application Programming Interface). The newer .wiff2 file is regularly updated as technology and acquisition modes advance to include all metadata to support your data analysis. It is, therefore, almost always advised to use the newest file format, .wiff2.
How do I import my SCIEX data if I want to use third-party software to process my metabolomics data?
The software you use will determine the approach you take. Many vendors collaborate closely with software developers and provide access to their data APIs, enabling direct reading of their native data formats. If you use third-party software with access to a data API reader, you can directly import .wiff or .wiff2 files, depending on the available API version. Third-party software with access to vendor APIs will use the vendor’s peak finder and, therefore, the optimized peak picking/centroiding algorithms. Still, it can add additional processing steps afterward, which could lead to variations in results between different software tools.
In cases where no API is available, vendor-specific raw mass spectrometry data formats must be converted into an open format to facilitate analysis using different software tools. One example of such a format is mzML, specifically designed to store and exchange mass spectrometry data. It is widely used in proteomics and metabolomics for its flexibility and compatibility with various software tools and platforms. Converted data can be stored in either profile or centroid format. However, profile data must be converted and centroided to utilize many third-party software tools fully.
How do I centroid and convert my profile data into mzML for processing?
Tools like msconvert in ProteoWizard1 and the R package MSnbase2 offer centroiding for raw and/or converted data. You should use the vendor settings in msconvert and follow the instructions listed in this document, as a shift of several ppm can occur with incorrect conversion parameters.
Important: Do not use the SCIEX data converter for metabolomics data. This converter is optimized for proteomics data and should only be used with IDA data for peptides, as it uses peak-picking algorithms from Protein Pilot software.
You can download the latest version of ProteoWizard to convert and centroid your profile data here: ProteoWizard Download.
To set ProteoWizard settings for the best conversion, making the ‘peakPicking’ filter the first filter in the list is crucial. Otherwise, the ProteoWizard software will ignore the vendor peak picking, leading to incorrect mass assignment for most peaks.
NOTE: The data file(s) to be processed must not be open. For example, if using Explorer in SCIEX OS software, close any file(s) you intend to process; otherwise, ProteoWizard will report a “Failed” status.
peakPicking [<PickerType> [snr=<minimum signal-to-noise ratio>] [peakSpace=<minimum peak spacing>] [msLevel=<ms_levels>]]
This filter performs centroiding on spectra with the selected <ms_levels>, expressed as an int_set. The value for <PickerType> must be “cwt” or “vendor”:
PickerType = “vendor”: Uses vendor (Windows dll) code if available.
PickerType = “cwt”: Uses ProteoWizard’s wavelet-based algorithm for peak-picking with a wavelet-space signal-to-noise ratio of <signal-to-noise ratio>.
Important: Peak picking should be performed in ProteoWizard using the vendor-based algorithm. The “Vendor.dll” is the set of files the vendor provided to ProteoWizard to read .wiff and .wiff2 files. These are the same files used in SCIEX OS software for reading data in PeakView software/Explorer.
NOTE: Since this filter operates on the raw data through the vendor files, IT MUST BE THE FIRST FILTER IN ANY LIST OF FILTERS when “vendor” is used.
- Remove all filters (such as titleMaker) that might already be in the processing window by using the Remove button.
- Select the Output format
- Add the filter Peak Picking filter. As noted above, ALWAYS Add Peak Picking as the first filter. No other filter set is necessary to generate the peak list.
- After Save Presets, click on the Start button to begin processing:
You can also view centroid data in PeakView and SCIEX OS software; there are several options as to which metric you use for the peak picking: intensity, height, area and intensity sum above 50% (Figure 2).
Figure 2. Centroid options in SCIEX OS software.
NOTE: If you use different metrics to centroid the data, there will be differences in the results reported between each method (Figure 3).
How and why do the results vary with the different centroiding metrics?
Figure 3. Comparison of (a) profile data to the three methods of centroiding (b) intensity, (c) height, (d) sum intensity above 50% peak height and (e) area (measured in counts per second (cps) times Da; cps∙Da)
There are several ways to centroid your data, and depending on the metric used, it can drastically alter your results (Figure 3). First, you need to consider that TOFs operate at a constant resolution, which means the width of peaks can change with the m/z value. High m/z peaks tend to be wider, leading to a larger area for the same peak height, compared to low m/z peaks, which are usually narrower but have a smaller area for the same peak height (Figure 4). Examining the raw spectra can give the illusion that one feature or fragment is more abundant than another. Intensity or height is often used in data processing, but when analyzing TOF data, height should be avoided. Height-centroided data also has the most variability due to noise, and this generates the worst results for quantitation, as all quantitative measures come from a single data point. When area or sum intensity are considered, which are much harder to visualize, the true quantity of each peak becomes more apparent. SCIEX peak finding measures the area by summing the intensity of each data point within the top 50% of the peak height (units of cps) or calculating the filled-in area of the peak (units of cps·Da). Both methods scale with the amount of signal.
Figure 4. Comparison of profile and 3 different metrics used to centroid TOFMS data. The low m/z peaks, such as 132 m/z, are usually narrower and have a smaller area for the same peak height compared to the high m/z peaks, for example, 829 m/z, which are wider, leading to a larger area for the same peak height.
Why do I get different results when I convert .wiff and .wiff2?
Remember, there is no spectral data in these files. Currently, .wiff files can only be read using an older Wiff Reader SDK, and the only centroid value it can measure is the area of the mass peak. In contrast, .wiff2 files can only be read by the new Data API, and it measures Sum Intensity Above 50%. You can see these different centroid values in SCIEX OS Explorer when you display the Data and Peaks Table for a spectrum.
As mentioned previously, the peak finder used with .wiff files will use Area, and .wiff2 will use the Sum Intensity Above 50%; both scale with the amount of signal. This also comes with a new challenge: the units. When using the .wiff file area in units of cps∙Da, these values can be very small (less than 1.0). Many open-source software tools fail to read values below 1, and they often have default thresholds of 100 as they assume this value is in cps, not area. To overcome this issue, the cps·Da values from older .wiff files must be multiplied by 100.
The peak finder used with .wiff2 files does not have this same issue, as it sums the cps at each data point above 50% peak height. This generally provides better CVs, as the more variable tails of the peak are excluded. However, it can be an issue for peaks with a low number of data points since 50% of the data points are excluded. This centroid data gives a true representation of every ion that hits the detector, which is necessary for quantitation.
Does it matter which data type you convert: DDA, SWATH-DIA, or MRM?
The converters will convert each MS/MS spectra, regardless of whether the MSMS was DDA, SWATH-DIA or MRMHR. The description of a DDA spectrum will include the candidate’s precursor m/z, while the SWATH MSMS will have values for the isolation window start and stop. If this is MRM from a nominal mass instrument, then it handles it as a chromatogram rather than individual spectra.
Can I convert my EAD data now using the same converter?
The ProteoWizard converter has recently been updated to read EAD vs. CID mode and electron kinetic energy. You can convert your .wiff2 file using the instructions above.
Vendors and developers constantly try to improve data conversion and integrate new workflows, so always use the latest version of ProteoWizard. Please get in touch with the developers if you find any bugs or have any suggestions.
References
- A cross-platform toolkit for mass spectrometry and proteomics. Chambers, M.C., et al. Nature Biotechnology 30, 918-920 (2012).
- MSnbase, Efficient and Elegant R-Based Processing and Visualization of Raw Mass Spectrometry Data. Gatto, L., et al. J Proteome Res 20, 1063-1069 (2021)
0 Comments