博文

[转载]EST序列拼接流程

已有 2820 次阅读 2019-1-28 09:41 |个人分类:生物信息|系统分类:科研笔记|文章来源:转载

http://blog.sina.com.cn/s/blog_4476400f0100iq0x.html

EST----》
对EST序列进行冗余查找，利用CD_HIT软件聚类，快速批量去除冗余序列

est-trimer（去掉帽子和尾巴，去掉太短而不可信的）------》

RepeatMaster（去掉转座子等重复）-----》

seqclean（去除载体，线粒体叶绿体等序列）------》

CAP3(拼接）

est-trimmer可以从 http://pgrc.ipk-gatersleben.de/misa/do- wnload/est_trimmer.pl 下载，就是个perl脚本，不用安装。脚本运行参数：

DESCRIPTION: Tool for trimming EST (DNA) sequences

## SYNTAX: est_trimmer.pl <FASTAfile> [-amb=n,win] [-tr5=(A|C|G|T),n,win]

## [-tr3=(A|C|G|T),n,win] [-cut=min,max] [-id=name]

## [-help]

## <FASTAfile> Single file in FASTA format containing the sequence(s).

## [-amb=n,win] Removes distal stretches containing "n" ambiguous bases in

## "win" bp sized window.

## [-tr5=N,n,win] Removes stretches of the given type N={A,C,G,T} from the 5

## end. Value "n" defines the min. accepted repeat number of

"N"

## in a 5' window of the size "win".

## [-tr3=N,n,win] according to [-tr5] for the 3' end.

## [-cut=min,max] Sets min. value for cutoff and max. sequence size.

## [-id=name] Optional. Final results are stored in "name".results, wher

eas

## processing steps are listed in "name".log. If not used,

## extensions are appended to <FASTAfile>.

## [-help] Further descriptions. Use "EST_trimmer.pl -help".

## Arguments can be used plurally and are processed according to their order

## EXAMPLE: est_trimmer.pl ESTs -amb=2,50 -tr5=T,5,50 -tr3=A,5,50 -cut=100,700

## ____________________________________________________________________________

___

个人觉得-amb 太恐怖了，还是没有，-cut 删除了太多了将700设定到最大，我是设定到10000。

我的命令：

perl est_trimmer.pl input -tr5=T,5,50 -tr3=A,5,50 -cut=100,10000 -id=output

repeatmasker 下载地址：http://repeatmasker.org/RMDownload.html

repeatmasker 是个比较复杂的软件，参数比较多，此外还必须在本机装过crossmatch或者wu-blast要多看手册根据自己实际情况设定。其软件有个数据库，每年都更新，本地计算的必须要注意。

此外 repeatmasker运行真是慢，最好可以设成几个CPU一起算。

我的命令 repeatmasker input -e crossmatch -s

seqclean (下载：http://compbio.dfci.harvard.edu/tgi/software/）

我倒是没遇到参数的问题，就是得在NCBI上下载下载体序列ftp://ftp.ncbi.nih.gov/pub/UniVec/ 里面还有个core的，和全的，我的数据反正算的快，就选了比较大的那个文件，将univec用formatdb命令格式化下就可以直接用了

我的命令

/usr/biosoft/blast-2.2.18/bin/formatdb -i UniVec -p F -o T

/usr/biosoft/seqclean/seqclean BnE091007.fasta -v UniVec -o BnE_clean.fasta

当是我因为程序的权限不够，怎么都用不了。后来用chmod把seqclean程序的文件夹的东西都改了才行。还好最后终于成功了

转载本文请联系原作者获取授权，同时请注明本文来自赵磊科学网博客。
链接地址：https://blog.sciencenet.cn/blog-299308-1159474.html

上一篇：[转载]将序列比对的结果保存为漂亮的图片
下一篇：[转载]How to Run TGICL

收藏 IP: 210.72.88.*| 热度|

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

数据加载中...

返回顶部

赵磊

扫一扫，分享此博文

生物信息分享 http://blog.sciencenet.cn/u/fhqdddddd

博文

[转载]EST序列拼接流程

repeatmasker 下载地址：http://repeatmasker.org/RMDownload.html

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

赵磊

全部作者的其他最新博文

全部精选博文导读

相关博文

生物信息分享 http://blog.sciencenet.cn/u/fhqdddddd

博文

[转载]EST序列拼接流程

repeatmasker 下载地址：http://repeatmasker.org/RMDownload.html

当前推荐数：0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

赵磊

全部作者的其他最新博文

全部精选博文导读

相关博文

该博文允许注册用户评论请点击登录评论 (0 个评论)