Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data

Revision as of 16:41, 28 February 2020 by Ckreutz (talk | contribs) (Outcome O1)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

1 Citation

Baolin Wu, Tom Abbott, David Fishman, Walter McMurray, Gil Mor, Kathryn Stone, David Ward, Kenneth Williams, Hongyu Zhao, Comparison of statistical methods forclassification of ovarian cancer usingmass spectrometry data, 2003, Bioinformatics, 19(13), 1636–1643.

Permanent link to the paper

2 Summary

The following classification methods were assessed in the context of mass spectrometry (MS) data:

  • linear discriminant analysis
  • quadratic discriminant analysis
  • k-nearest neighbor classifier
  • bagging and boosting classification trees
  • support vector machine
  • random forest (RF)

Ovarian cancer and control serum samples were intended to be classified. Assessment was performed by crossvalidation.

3 Study outcomes

Predictions error were around 10-20%.

3.1 Outcome O1

  • Overall, the methods perform better if RF is used for feature selection (with few exceptions)
  • Overall, the methods perform better 25 features were used for classifcation, instead of 15 (with few exceptions)
  • Some approaches have a large variance in performance (Bagging, ARC) others had rather small variance (NN, SVM)
  • The .632+ estimator yielded slightly decreases estimation errors (better performances) and smaller variances compared to 10-fold CV.
  • Overall, RF had the best performance when feature selection was performed using RF

Outcomes O1 are presented as Figures 4 and 5 in the original publication.

4 Study design and evidence level

4.1 General aspects

A single data set containing measurments from 47 patients with ovarian cancer and from 44 normal patients were used for the analysis.

Two crossvaliation methods were used:

  • 10-fold crossvalidation
  • Bootstrap and the .632+ estimator
  • 15 or 25 features were selected by either t-statistics or random forests (RF)

5 Further comments and aspects

The data was generated at a time, where MS data quality and data processing was in the development stage. Data processing approaches for the raw spectra like MaxQuant, OpenMS etc were not yet available.

6 References