TBtools富集过程中的一些小操作
sed "s/[(][^)]*[)]//g" cxx.go.annotation.xls | sed 's/ activity)//g' | sed 's/[a-zA-Z]\+)//g' | sed 's/)//g' | grep "evm" | sed 's/-//g' | tr "\t" "," | sed 's/,\+/,/g' | sed 's/,$//g' | sed 's/,/\t/' | sed 's/,/,t/g' > dt.bg
fasta2phy:
cat 4.4Dsites.pl.connect4Dsites.fa | tr '\n' '\t' | \
pipe pipe> sed 's/>/\n/g' |
pipe pipe pipe> sed 's/\t/ /' | sed 's/\t//g' |
pipe pipe pipe pipe pipe> awk 'NF > 0' > 4.4Dsites.pl.connect4Dsites.fa.tmp
awk '{print " "NR" "length($2)}' 4.4Dsites.pl.connect4Dsites.fa.tmp |
pipe> tail -n 1 | \
pipe pipe> cat - 4.4Dsites.pl.connect4Dsites.fa.tmp > 4.4Dsites.pl.connect4Dsites.phy
fasteprr 分型报错!
zcat input.vcf.gz | perl -pe "s/\s.:/\t./.:/g" | bgzip -c >output.vcf.gz
Structure
For population structure
analyses, we discarded SNPs with missing rate > 20%
and minor allele frequency (MAF) < 5%. We also excluded
highly correlated SNPs by performing an LD-based SNP
pruning process in PLINK v1.90
Mental test
VEGAN(R package)
质控软件
trimmomatic (需要知道接头序列)
FSC
We excluded SNPs from three genomic regions
under long-term balancing selection (see the “Balancing
selection in B. stricta genomes” section) and only used
fourfold degenerate sites and intergenic regions, because
they are less affected by selection.
To account for the influence of effective population
size on estimated ρ, we divided ρ by diversity (π) in
each 20-kb window following Wang et al. [4] and compared
ρ/π between islands and the rest of the genome.
提取奇数行 sed -n '1~2p' a.txt
提取偶数行 sed -n '0~2p' a.txt
Linux文件随机抽取N行:
shuf -n100 filename
sort -R filename | head -n100
awk '{x+=1}{print $1"\tSNP"x"\t"$3"\t"$4}' test.map > map
discoVista
singularity exec discovista_latest.sif discoVista.py -a name -p 03.phlotree/tree -o figure -m 5 -g Cattle-
其中 : tree 文件夹中: estimated_gene_trees.tree 基因树
estimated_species_tree.tree 并联树
name 文件:
cattle Cattle-
horse Horse-
gorilla Gorilla-
human Human-
fa 转单行
perl -pe '/^>/ ? print "\n" : chomp' protein.fa | tail -n +2 > protein_new.fa
sort 排序
sort -k 1,1 -k 2,2n
提取bam中一定的区域
bedtools intersect -a bam -b region
bed文件补集
bedtools comp啥啥的
三代测序检测
Nextstat
Nextplot
singularity sif 转 sandbox
singularity build --sandbox ./tmp/ your.sif
各种文件格式的坐标设置:
有关TE的十件事
1. TE有很多类型
2. TE在基因组上的分布不是随机的,有插入偏好性,但会被选择或遗传漂变清除。影响小或无影响的TE也会被清除
3. TE是突变和遗传多态性的广泛来源 面对stress时,TE更活跃
4. TE可以导致genome rearrangements 和 SV
5. TE表达和抑制存在平衡 TE要控制自己的拷贝数,生物体的一些因子也能控制TE表达;reduced DNA甲基化可以促进TE表达。在不同组织,个体不用生长阶段,TE也有差异。
6. 在生殖系和体细胞中,TEs都是插入型突变因子
7.在不涉及转位的情况下,TEs也会对基因组造成损害
8.转座子可能是编码基因或非编码RNA的来源(TE→gene,ex:Rag1 Rag2)
9.TE提供顺式作用元件并修改调控网络
10.需要用到不同的工具
几款构建祖先染色体的软件
[ANGeS/anges_1_01_v2.pdf at master · cchauve/ANGeS (github.com)](https://github.com/cchauve/ANGeS/blob/master/anges_1_01_v2.pdf)
[jkimlab/DESCHRAMBLER (github.com)](https://github.com/jkimlab/DESCHRAMBLER)
普遍认为祖先物种有甲基化 随着演化一些物种甲基化丢失
gene body 区甲基化可能促进基因表达
根据基因组序列文件和注释文件找到基因组的CDS区,内含子区,非基因区
samtools faidx input.fasta
cat input.fasta.fai | cut -f 1,2 | sort -k 1 > size.sort.txt
cat input.gff3 | awk '$3=="gene"{print$0}' | sort -k 1,1 -k 4,4n | cut -f 1,4,5 > gene.region.sort.bed
cat input.gff3 | awk '$3=="CDS"{print$0}' | sort -k 1,1 -k 4,4n | cut -f 1,4,5 > cds.region.sort.bed
bedtools complement -i gene.region.sort.bed -g size.sort.txt > non-gene.region.bed
bedtools subtract -a gene.region.sort.bed -b cds.region.sort.bed > intron.region.sort.bed