【4】蛋白质组学鉴定软件之MSGFPlus

1.简介

MSGF+也是近年来应用得比较多的蛋白鉴定软件。java写的,2008年初次发表JPR,2014年升级发表NC,免费开源,持续更新维护,良心软件。而且,有研究者对不同蛋白质组学鉴定软件进行比较分析,MSGF+的表现也是非常不错的(一下子找不到文献出处~~)。

Github源码:https://github.com/MSGFPlus/msgfplus
支持的输入格式包括:mzML, mzXML, Mascot Generic File (mgf), MS2 files, Micromass Peak List files (pkl), Concatenated DTA files (_dta.txt)
主要支持HUPO PSI 的标准输入mzML格式,以及输出mzIdentML格式(简写mzid ),易转化为TSV格式。

关于mzIdentML格式,参考http://www.psidev.info/mzidentml

2.安装运行

软件下载:https://github.com/MSGFPlus/msgfplus/releases

image.png

关于使用,MS-GF+有非常详细的文档:MS-GF+ Documentation

参数配置文件:
https://github.com/MSGFPlus/msgfplus/tree/master/docs/ParameterFiles

关于运行,提供了很多示例以及参数的解释:
https://msgfplus.github.io/msgfplus/MSGFPlus.html

运行示例1:

java -Xmx4000M -jar MSGFPlus.jar \
  -s test.mzML \
  -d uniprot_swissprot_human_20190313_20417.fasta \
  -t 20ppm -ti -1,2 -ntt 0 -tda 1 -e 0 -m 3 -inst 3 -minCharge 1 -maxCharge 6 -addFeatures 1 \
  -mod Mods.txt \
  -o test.mzid

修饰文件Mods.txt内容如下:

# This file is used to specify modifications
# # for comments
#
# Max Number of Modifications per peptide
# If this value is large, the search takes long.
NumMods=2

# To input a modification, use the following command:
# Mass or CompositionStr, Residues, ModType, Position, Name (all the five fields are required).
# CompositionStr (C[Num]H[Num]N[Num]O[Num]S[Num]P[Num]Br[Num]Cl[Num]Fe[Num])
#       - C (Carbon), H (Hydrogen), N (Nitrogen), O (Oxygen), S (Sulfer), P (Phosphorus), Br (Bromine), Cl (Chlorine), Fe (Iron), and Se (Selenium) are allowed.
#       - Negative numbers are allowed.
#       - E.g. C2H2O1 (valid), H2C1O1 (invalid)
# Mass can be used instead of CompositionStr. It is important to specify accurate masses (integer masses are insufficient).
#       - E.g. 15.994915
# Residues: affected amino acids (must be upper letters)
#       - Must be uppor letters or *
#       - Use * if this modification is applicable to any residue.
#       - * should not be "anywhere" modification (e.g. "15.994915, *, opt, any, Oxidation" is not allowed.)
#       - E.g. NQ, *
# ModType: "fix" for fixed modifications, "opt" for variable modifications (case insensitive)
# Position: position in the peptide where the modification can be attached.
#       - One of the following five values should be used:
#       - any (anywhere), N-term (peptide N-term), C-term (peptide C-term), Prot-N-term (protein N-term), Prot-C-term (protein C-term)
#       - Case insensitive
#       - "-" can be omitted
#       - E.g. any, Any, Prot-n-Term, ProtNTerm => all valid
# Name: name of the modification (Unimod PSI-MS name)
#       - For proper mzIdentML output, this name should be the same as the Unimod PSI-MS name
#       - E.g. Phospho, Acetyl
#       - Visit http://www.unimod.org to get PSI-MS names.

C2H3N1O1,C,fix,any,Carbamidomethyl              # Fixed Carbamidomethyl C
#144.102063,*,fix,N-term,iTRAQ4plex             # iTRAQ 4 plex
#144.102063,K,fix,any,iTRAQ4plex                        # iTRAQ 4 plex

# Variable Modifications (default: none)
O1,M,opt,any,Oxidation                          # Oxidation M
#15.994915,M,opt,any,Oxidation                  # Oxidation M (mass is used instead of CompositionStr)
H-1N-1O1,NQ,opt,any,Deamidated                  # Negative numbers are allowed.
#C2H3NO,*,opt,N-term,Carbamidomethyl            # Variable Carbamidomethyl N-term
#H-2O-1,E,opt,N-term,Glu->pyro-Glu                      # Pyro-glu from E
#H-3N-1,Q,opt,N-term,Gln->pyro-Glu                      # Pyro-glu from Q
#C2H2O,*,opt,Prot-N-term,Acetyl                 # Acetylation Protein N-term
#C2H2O1,K,opt,any,Acetyl                        # Acetylation K
#CH2,K,opt,any,Methyl                           # Methylation K
#HO3P,STY,opt,any,Phospho                       # Phosphorylation STY

运行示例2

java -Xmx4g -Xms1g -jar MSGFPlus.jar 
-conf MSGFPlus_Parameters.txt \
-d test.fasta \
-s test.mzML \
-o test.mzid

参数配置文件MSGFPlus_Parameters.txt内容如下:

#Parent mass tolerance
#  Examples: 2.5Da or 30ppm
#  Use comma to set asymmetric values, for example "0.5Da,2.5Da" will set 0.5Da to the left (expMass<theoMass) and 2.5Da to the right (expMass>theoMass)
PrecursorMassTolerance=20ppm

#Max Number of Modifications per peptide
# If this value is large, the search will be slow
NumMods=5

#Modifications (see below for examples)
StaticMod=C2H3N1O1,  C,   fix,  any,  Carbamidomethyl              # Fixed Carbamidomethyl C
DynamicMod=O1,       M,   opt,  any,  Oxidation                    # Oxidized methionine
DynamicMod=H-1N-1O1, NQ,  opt,  any,  Deamidated                   # Deamidation of Glutamine (+0.984016)

#Custom amino acids
CustomAA=C3H5NO,     U,  custom, U,   Selenocysteine               # Custom amino acids can only have C, H, N, O, and S
#CustomAA=H0,        X,  custom, X,   RemoveAA                     # Remove AA

#Fragmentation Method
#  0 means as written in the spectrum or CID if no info (Default)
#  1 means CID
#  2 means ETD
#  3 means HCD
#  4 means Merge spectra from the same precursor (e.g. CID/ETD pairs, CID/HCD/ETD triplets)
FragmentationMethodID=3

#Instrument ID
#  0 means Low-res LCQ/LTQ (Default for CID and ETD); use InstrumentID=0 if analyzing a dataset with low-res CID and high-res HCD spectra
#  1 means High-res LTQ (Default for HCD; also appropriate for high res CID); use InstrumentID=1 for Orbitrap, Lumos, and QEHFX instruments
#  2 means TOF
#  3 means Q-Exactive
InstrumentID=1

#Enzyme ID
#  0 means No enzyme used
#  1 means Trypsin (Default); use this along with NTT=0 for a no-enzyme search of a tryptically digested sample
#  2: Chymotrypsin, 3: Lys-C, 4: Lys-N, 5: Glu-C, 6: Arg-C, 7: Asp-N, 8: alphaLP, 9: No Enzyme (for peptidomics)
EnzymeID=1

#Isotope error range
#  Takes into account of the error introduced by choosing non-monoisotopic peak for fragmentation.
#  Useful for accurate precursor ion masses
#  Ignored if the parent mass tolerance is > 0.5Da or 500ppm
#  The combination of -t and -ti determins the precursor mass tolerance.
#  e.g. "-t 20ppm -ti -1,2" tests abs(exp-calc-n*1.00335Da)<20ppm for n=-1, 0, 1, 2.
IsotopeErrorRange=0,3

#Number of tolerable termini
#  The number of peptide termini that must have been cleaved by the enzyme (default 1)
#  For trypsin, 2 means fully tryptic only, 1 means partially tryptic, and 0 means no-enzyme search
NTT=2

#Target/Decoy search mode
#  0 means don't search decoy database (default)
#  1 means search decoy database to compute FDR (source FASTA file must be forward-only proteins)
TDA=1

#Number of Threads (by default, uses all available cores)
NumThreads=8

#Minimum peptide length to consider
MinPepLength=6

#Maximum peptide length to consider
MaxPepLength=50

#Minimum precursor charge to consider (if not specified in the spectrum)
MinCharge=1

#Maximum precursor charge to consider (if not specified in the spectrum)
MaxCharge=6

#Number of matches per spectrum to be reported
#If this value is greater than 1 then the FDR values computed by MS-GF+ will be skewed by high-scoring 2nd and 3rd hits
NumMatchesPerSpec=1

#Amino Acid Modification Examples
# Specific static modifications using one or more StaticMod= entries
# Specific dynamic modifications using one or more DynamicMod= entries
# Modification format is:
# Mass or CompositionStr, Residues, ModType, Position, Name (all the five fields are required).
# Examples:
#   C2H3N1O1,  C,  fix, any,         Carbamidomethyl    # Fixed Carbamidomethyl C (alkylation)
#   O1,        M,  opt, any,         Oxidation          # Oxidation M
#   15.994915, M,  opt, any,         Oxidation          # Oxidation M (mass is used instead of CompositionStr)
#   H-1N-1O1,  NQ, opt, any,         Deamidated         # Negative numbers are allowed.
#   CH2,       K,  opt, any,         Methyl             # Methylation K
#   C2H2O1,    K,  opt, any,         Acetyl             # Acetylation K
#   HO3P,      STY,opt, any,         Phospho            # Phosphorylation STY
#   C2H3NO,    *,  opt, N-term,      Carbamidomethyl    # Variable Carbamidomethyl N-term
#   H-2O-1,    E,  opt, N-term,      Glu->pyro-Glu      # Pyro-glu from E
#   H-3N-1,    Q,  opt, N-term,      Gln->pyro-Glu      # Pyro-glu from Q
#   C2H2O,     *,  opt, Prot-N-term, Acetyl             # Acetylation Protein N-term

#Custom amino acids examples
# Only supports empirical formulas of elements C H N O S.
# If other elements are needed, or a specific mass is needed, they can be added as fixed modifications on the custom AA
# Maximum atom counts: 255 C, 255 H, 63 N, 63 O, 15 S
# Format spec is:
# EmpiricalFormula, ResidueSymbol, custom, OriginalAA, Name (all the five fields are required, though OriginalAA is not actually used for anything)
# Examples:
#   C5H7N1O2S0,J,custom,P,Hydroxylation     # Hydroxyproline
#   C3H6N2O0S1,X,custom,C,Amidation         # C-terminal amidation of Cys
#   C5H5N1O1S0,Z,custom,E,Glu->pyro-Glu     # N-terminal pyroGlu residue, from either Glu OR Gln

3.结果

原始输出格式MzIdentML,示例文件test.mzid

image.png

有2种方法将mzid文件转化为tsv,使结果更加易读。详见https://msgfplus.github.io/msgfplus/MzidToTsv.html

  • 一是MSGFPlus.jar内置的MzIDToTsv工具,实现容易,但对于大文件慢。
Usage: java -Xmx3500M -cp MSGFPlus.jar edu.ucsd.msjava.ui.MzIDToTsv
    -i MzIDFile (MS-GF+ output file (*.mzid))
    [-o TSVFile] (TSV output file (*.tsv) (Default: MzIDFileName.tsv))
    [-showQValue 0/1] (0: do not show Q-values, 1: show Q-values (Default))
    [-showDecoy 0/1] (0: do not show decoy PSMs (Default), 1: show decoy PSMs)
    [-unroll 0/1] (0: merge shared peptides (Default), 1: unroll shared peptides)
  • 二是单独使用MzidToTsvConverter.exe工具,转化快,处理大文件,限于Windows(Linux需要mono)
MzidToTsvConverter.exe -mzid:SearchResults.mzid -unroll -showDecoy

转化为tsv后的示例文件:test_Unrolled.tsv

image.png

表头内容包含:

      1 #SpecFile
      2 SpecID
      3 ScanNum
      4 FragMethod
      5 Precursor
      6 IsotopeError
      7 PrecursorError(ppm)
      8 Charge
      9 Peptide
     10 Protein
     11 DeNovoScore
     12 MSGFScore
     13 SpecEValue
     14 EValue
     15 QValue
     16 PepQValue

ref:
https://msgfplus.github.io/msgfplus/index.html
http://www.psidev.info/mzidentml
https://omics.pnl.gov/software/ms-gf
https://github.com/MSGFPlus/msgfplus
https://github.com/MSGFPlus/msgfplus/tree/master/docs/ParameterFiles
https://msgfplus.github.io/msgfplus/MzidToTsv.html
https://github.com/MSGFPlus/msgfplus/releases


蛋白质组学鉴定定量软件总结:
【1】蛋白质组学鉴定软件之X!Tandem
【2】蛋白质组学鉴定软件之Comet
【3】蛋白质组学鉴定软件之Mascot
【4】蛋白质组学鉴定软件之MSGFPlus
【5】蛋白质组学鉴定定量软件之PD
【6】蛋白质组学鉴定定量软件之MaxQuant

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 228,363评论 6 532
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 98,497评论 3 416
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 176,305评论 0 374
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 62,962评论 1 311
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 71,727评论 6 410
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 55,193评论 1 324
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 43,257评论 3 441
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 42,411评论 0 288
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 48,945评论 1 335
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 40,777评论 3 354
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 42,978评论 1 369
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 38,519评论 5 359
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 44,216评论 3 347
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 34,642评论 0 26
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 35,878评论 1 286
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 51,657评论 3 391
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 47,960评论 2 373