pybedtools文档-平行在外显子和内含子中计算reads数

本例子展示的是如何在内含子和外显子中统计reads数量。不但包括了这个内容,还包括了其他的命名展示。例如

1、BAM file support (for more, see Working with BAM files)

2、indexing into Interval objects (for more, see Intervals)

3、filtering (for more, see Filtering)

4、streaming (for more, see Using BedTool objects as iterators/generators)

5、ability to use parallel processing

下面是含有注释信息的命令行展示

import sys

import multiprocessing

import pybedtools


# get example GFF and BAM filenames 获得例子中的GFF和BAM的内容

gff = pybedtools.example_filename('gdc.gff')

bam = pybedtools.example_filename('gdc.bam')


# Some GFF files have invalid entries -- like chromosomes with negative coords GFF文件中含有不实的内容,例如染色为负数坐标的信息

# orfeatures of length = 0.  This lineremoves them and saves the result in a tempfile 或者特征值的长度为0,可以使用命令行去除这些无意义的内容,并且存储为另外一个文件。

g = pybedtools.BedTool(gff).remove_invalid().saveas()

# Next, we create a function to pass only features for a particular featuretype.  This is similar to a"grep" operation when applied to every#feature in a BedTool

随后,我们构建了一个函数用来只显示符合要求的特征,类似于grep命令

def featuretype_filter(feature, featuretype):

    if feature[2] == featuretype:

        return True

    return False

# This function will eventually be run in parallel, applying the filter above to several different BedTools simultaneously 多种不同的BedTools同时选择

def subset_featuretypes(featuretype):

    result = g.filter(featuretype_filter, featuretype).saveas()

    return pybedtools.BedTool(result.fn)


# Thisfunction performs the intersection of a BAM file with a GFF file and returns the total number of hits.  Itwill eventually be run in parallel.

def count_reads_in_features(features_fn):

    """

    Callback function to count reads infeatures

    """

  # BAM files are auto-detected; no need foran `abam` argument.  Here we # construct a new BedTool out of the BAM file and intersect it with the features filename. We use stream=True so that no intermediate tempfile is created, and bed=True so that the.count() method can iterate through the resulting streamed BedTool.

    return pybedtools.BedTool(bam).intersect(b=features_fnstream=True).count()

# Set up a pool of workers for parallel processing pool = multiprocessing.Pool()

Create separate files for introns and exons, using the function we defined above

featuretypes = ('intron', 'exon')

introns, exons = pool.map(subset_featuretypes, featuretypes)


Perform some genome algebra to get unique and shared intron/exon regions. Here we keep only the filename of the results, which is safer than an entire BedTool for passing around in parallel computations.

exon_only = exons.subtract(introns).merge().remove_invalid().saveas().fn

intron_only = introns.subtract(exons).merge().remove_invalid().saveas().fn

intron_and_exon = exons.intersect(introns).merge().remove_invalid().saveas().fn

# Do intersections with BAM file in parallel, using the other function we defined above

features = (exon_only, intron_only, intron_and_exon)

results = pool.map(count_reads_in_features, features)


# Print the results

labels = (' exon only:', ' intron only:', 'intron and exon:')

for label, reads in zip(labels, results):

    sys.stdout.write('%s %s\n' % (label, reads))


©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • 古人云:临摹用工,是学书大要。必先求古人意指,次究用笔,后像形体。唐太宗云吾临古人之书,殊不学其形似,务在求其气骨...
    愚公一海阅读 649评论 0 0
  • 8 在这个城市,与自己轨迹最重合的那个人,通常便是前任。她知道你喜欢去吃的特色美食街,大到会去的区域,小到哪家餐厅...
    MissMatcha阅读 541评论 0 0
  • 虽然大学生创业有很多劣势,前面的文章分析过,也给出了解决方式。但是在校大学生创业也有一些他人不具备的优势。 ...
    润东先生阅读 1,415评论 8 18