Sequence-Searching Related Tasks
Modules included in this section
makeblastdb
*
- Authors
Menachem Sklarz
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
Create a blastdb from a fasta file
Requires
fastq files in the following slots:
sample_data[<sample>]["fasta.nucl"|"fasta.prot"]
- Or (if ‘projectBLAST’ is set)
sample_data["fasta.nucl"|"fasta.prot"]
Output
A BLAST database in the following slots:
sample_data[<sample>]["blastdb"]
sample_data[<sample>]["blastdb.nucl"|"blastdb.prot"]
sample_data[<sample>]["blastdb.nucl.log"|"blastdb.prot.log"]
Or (if ‘projectBLAST’ is set):
sample_data["blastdb"]
sample_data["blastdb.nucl"|"blastdb.prot"]
sample_data["blastdb.nucl.log"|"blastdb.prot.log"]
Parameters that can be set:
Parameter |
Values |
Comments |
---|---|---|
scope |
sample|project |
Set if project-wide or sample fasta slot should be used |
-dbtype |
nucl/prot |
This is a compulsory redirected parameter.Helps the module decide which fasta file to use. |
Lines for parameter file
mkblst1:
module: makeblastdb
base: trinity1
script_path: /path/to/bin/makeblastdb
redirects:
-dbtype: nucl
scope: project
References
Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 25(17), pp.3389-3402.
blast
- Authors
Menachem Sklarz
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for executing BLAST of any type on a nucleotide or protein fasta file. The search can be either on a sample fasta or on a project-wide fasta. It can use the fasta as a database or as a query. If used as a database, you must call the makeblastdb module prior to this step.
both query
and db
parameters must be passed. They should be set to one of the following values:
Value |
Description |
---|---|
|
The |
|
The |
A path |
A path to a fasta file or |
The type of fasta and database to use are set with the querytype
and dbtype
parameters, respectively.
dbtype
must be set ifdb
is set tosample
orproject
.querytype
must be set regardless. It will determine the type of blast report (i.e. whether it will be stored inblast.nucl
orblast.prot
)
Requires:
fasta files in one of the following slots for sample-wise blast:
sample_data[<sample>]["fasta.nucl"]
sample_data[<sample>]["fasta.prot"]
or fasta files in one of the following slots for project-wise blast:
sample_data["fasta.nucl"]
sample_data["fasta.prot"]
or a
makeblastdb
index in one of the following slots:When
-db
is set to ‘project’sample_data["blastdb.nucl"|"blastdb.prot"]
When
-db
is set to ‘sample’sample_data[<sample>]["blastdb.nucl"|"blastdb.prot"]
File type |
Scope |
Comments |
---|---|---|
|
sample/project |
If |
|
sample/project |
If |
|
sample/project |
If |
|
sample/project |
If |
Output:
puts BLAST output files in the following slots for sample-wise blast:
sample_data[<sample>]["blast.nucl"|"blast.prot"]
sample_data[<sample>]["blast"]
puts fasta output files in the following slots for project-wise blast:
sample_data["blast.nucl"|"blast.prot"]
sample_data["blast"]
File type |
Scope |
Comments |
---|---|---|
|
sample/project |
Blast report if |
|
sample/project |
Blast report if |
|
sample/project |
Blast report, regardless of |
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
dbtype |
nucl|prot |
Helps the module decide which blastdb to use. |
querytype |
nucl|prot |
Helps the module decide which fasta file to use. |
query |
sample|project|<Path to fasta or BLAST index> |
Set to |
db |
sample|project|<Path to BLAST index> |
Set to |
Note
You can’t set both db
and query
to external files. One of them at least has to be sample
or project
.
Lines for parameter file
External query, project-wise nucl-type database (must be proceeded by makeblastdb
module):
tbl_blst_int:
module: blast
base: mkblst1
script_path: {Vars.Programs.blast.Bin}/blastn
query: /path/to/query.fasta
db: project
dbtype: nucl
redirects:
-evalue: 0.0001
-max_target_seqs: 5
-num_of_proc: 20
-num_threads: 20
Sample specific prot-type fasta, external database:
tbl_blst_ext:
module: blast
base: prokka1
script_path: {Vars.Programs.blast.Bin}/blastp
query: sample
querytype: prot
db: {Vars.Genome.blast_index}
redirects:
-evalue: 0.0001
References
Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 25(17), pp.3389-3402.
parse_blast
- Authors
Menachem Sklarz
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A module for running parse_blast.R
:
The parse_blast.R
script is available on github.
The program performs the following tasks:
It adds annotation to raw tabular BLAST output files,
filters the BLAST results by several possible fields,
selects the best hit for a group when passed a grouping field and
extracts the sequences equivalent to the alignments.
Requires
Tabular BLAST result files in the following slots:
sample_data[<sample>]["blast.nucl|blast.prot"]
(ifscope
set tosample
)sample_data["project_data"]["blast.nucl|blast.prot"]
(ifscope
set toproject
)
File type |
Scope |
Comments |
---|---|---|
|
sample/project |
A blast report for a |
Attention
If both blast.nucl
and blast.prot
exist, determine which to use by setting fasta2use
. See parameter table below.
Output
Puts the parsed report in:
sample_data[<sample>]["blast.parsed"]
ifscope = sample
sample_data["project_data"]["blast.parsed"]
ifscope = project
File type |
Scope |
Comments |
---|---|---|
|
sample/project |
Results of parsed blast report |
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
fasta2use |
|
If both nucl and prot BLAST reports exist, you have to specify which one to use with this parameter. |
blast_merge |
Block with |
|
extract_fasta |
Should the script extract a fasta of the hits? |
Note
path
in blast_merge
block can be left empty. The script will be taken from the same location as the main parse_blast.R
script.
redirects
in blast_merge
block can be either in string format or the regular block format.
Lines for parameter file
parse_blast_table:
module: parse_blast
base: blst_table
script_path: {Vars.paths.parse_blast}
scope: sample
redirects:
--columns2keep: '"group name accession qseqid sallseqid evalue bitscore score pident coverage align_len"'
--dbtable: {Vars.databases.gene_list.table}
--group_dif_name: # See parse_blast.R documentation for how this is to be specified
--max_evalue: 1e-7
--merge_blast: qseqid
--merge_metadata: # See parse_blast.R documentation for how this is to be specified
--min_align_len: 30
--min_coverage: 60
--names: '"qseqid sallseqid qlen slen qstart qend sstart send length evalue bitscore score pident qframe"'
--num_hits: 1
extract_fasta:
blast_merge:
path: '{Vars.paths.compare_blast_parsed_reports}'
redirects:
--variable: evalue
--full_txt_output:
Gassst
- Authors
Liron Levin
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
Note
This module was developed as part of a study led by Dr. Jacob Moran Gilad
Short Description
A module for executing Gassst on a nucleotide fasta file. The search can be either on a sample fasta or on a project-wide fasta. It can use the fasta as a database or as a query.
Requires
- fasta files in the following slot for sample-wise Gassst:
sample_data[<sample>]["fasta.nucl"]
- or fasta files in the following slots for project-wise Gassst:
sample_data["fasta.nucl"]
Output
- puts Gassst output files in the following slots for sample-wise Gassst:
sample_data[<sample>]["blast"]
sample_data[<sample>]["blast.nucl"]
- puts fasta output files in the following slots for project-wise Gassst:
sample_data["blast"]
sample_data["blast.nucl"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
scope |
project/sample |
Set if project-wide fasta.nucl file type should be used [project] the default is sample-wide fasta.nucl file type |
Lines for parameter file
Step_Name: # Name of this step
module: Gassst # Name of the module to use
base: # Name of the step [or list of names] to run after [mast be after a fasta generating step]
script_path: # Command for running the Gassst script
# The Gassst module will generate blast like output with fields:
# "qseqid sallseqid qlen slen qstart qend sstart send length evalue sseq"
scope: # Set if project-wide fasta.nucl file type should be used [project] the default is sample-wide fasta.nucl file type
qsub_params:
-pe: # Number of CPUs to reserve for this analysis
redirects:
-h: # Max hits per query, for downstream best hit will be chosen!
-i: # Only -d [database] or -i [query] not both
-l: # Complexity_filter off
-d: # Only -d [database] or -i [query] not both
-n: # Number of CPUs running Gassst
-p: # Minimum percentage of identity. Must be in the interval [0 100]
References
Rizk, Guillaume, and Dominique Lavenier. “GASSST: global alignment short sequence search tool.” Bioinformatics 26.20 (2010): 2534-2540.
hmmscan
- Authors
Menachem Sklarz
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A module for searching a fasta file with hmmscan.
Requires
If
scope = sample
,fasta
files in one of the following slots:sample_data[<sample>]["fasta.nucl"]
sample_data[<sample>]["fasta.prot"]
If
scope = project
,fasta
files in one of the following slots:sample_data["fasta.nucl"]
sample_data["fasta.prot"]
Output:
puts
hmmscan
output files in the following slots:for
scope = sample
(depending ontype
passed):sample_data[<sample>]["hmmscan.nucl"]
sample_data[<sample>]["hmmscan.prot"]
for
scope = project
(depending ontype
passed):sample_data["hmmscan.nucl"]
sample_data["hmmscan.prot"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
scope |
sample|project |
Create one assembly for all samples or one assembly per sample. |
type |
Use a prot or nucl fasta file for the search. |
|
output_type |
tblout|domtblout|pfamtblout |
tblout: parseable table of per-sequence hits to file, domtblout: parseable table of per-domain hits to file, pfamtblout: table of hits and domains in Pfam format |
hmmdb |
A path to the hmmdb to search against. |
Lines for parameter file
trino_hmmscan1_highExpr:
module: hmmscan
base: trino_Transdecode_highExpr
script_path: {Vars.paths.hmmscan}
scope: sample
type: prot
output_type: domtblout
hmmdb: {Vars.databases.trinotate.pfam}
qsub_params:
-pe: shared 10
redirects:
--cpu: 1
References
Finn, Robert D., Jody Clements, and Sean R. Eddy. “HMMER web server: interactive sequence similarity searching.” Nucleic acids research 39.suppl_2 (2011): W29-W37.
mash_sketch
- Authors
Menachem Sklarz
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
Build mash sketches from sequence files.
Works in three modes:
scope=sample
Builds a separate sketch for each sample
scope=project
andsrc_scope=sample
Builds a project wide sketch from sample sequence files. This can be used with
mash_dist
module to perform all-against-all comparisons.
scope=project
Builds a sketch from project sequence files.
Requires:
fasta files in one of the following slots:
sample_data[<sample>]["fasta.nucl"]
or fastq files in the following slots:
sample_data[<sample>]["fastq.F"]
sample_data[<sample>]["fastq.R"]
sample_data[<sample>]["fastq.S"]
For
scope = project
, uses project-wide files.
Output:
puts ‘msh’ output files in the following slots for (scope=sample):
sample_data[<sample>]["msh.fasta"]
sample_data[<sample>]["msh.fastq"]
puts ‘msh’ output files in the following slots for (scope=project):
sample_data["project_data"]["msh.fasta"]
sample_data["project_data"]["msh.fastq"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
scope |
project|sample |
The scope for which to build the sketch. |
src_scope |
project|sample |
The scope from which to take the sequence files. Default - same as |
type |
nucl|prot |
Use fastq or fasta files. By default, uses any that exist. |
Lines for parameter file
Create sketch for each sample based on fastq files
sketch_smp:
module: mash_sketch
base: trim_gal
script_path: "{Vars.paths.mash} sketch"
scope: sample
type: fastq
rm_merged:
qsub_params:
-pe: shared 10
redirects:
-m: 2
-p: 10
Create project sketch for all samples’ fastq files
sketch_proj:
module: mash_sketch
base: merge1
script_path: "{Vars.paths.mash} sketch"
src_scope: sample
scope: project
type: fastq
rm_merged:
redirects:
-m: 2
-p: 10
References
Ondov, Brian D., et al. Mash: fast genome and metagenome distance estimation using MinHash Genome biology, 17.1 (2016): 132.
mash_dist
- Authors
Menachem Sklarz
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
Requires:
fasta files in one of the following slots:
sample_data[<sample>]["fasta.nucl"]
sample_data["fasta.nucl"]
OR fastq files in one of the following slots (merge fastq files first with mash_sketch or otherwise):
sample_data[<sample>]["fastq"]
sample_data["fastq"]
OR sketch files in one of the following slots:
sample_data[<sample>]["msh.fastq"]
sample_data[<sample>]["msh.fasta"]
sample_data["msh.fastq"]
sample_data["msh.fasta"]
Output:
puts ‘msh’ output files in the following slots for (scope=sample):
sample_data[<sample>]["msh.fasta"]
sample_data[<sample>]["msh.fastq"]
puts ‘msh’ output files in the following slots for (scope=project and scope=all_samples):
sample_data[<sample>]["mash.dist.table"]
sample_data["mash.dist.table"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
reference |
A block including ‘path’ or ‘scope’, ‘type’ and optionally ‘msh’ |
|
query |
A block including ‘scope’ (sample, project or all_samples), ‘type’ and optionally ‘msh’ |
Lines for parameter file
- External reference. Sample-wise fastq files.
Returns table of mash dist of sample against external reference. One table per sample
dist:
module: mash_dist
base: [sketch_proj,sketch_smp]
script_path: "{Vars.paths.mash} dist"
reference:
path: /path/to/ref1
query:
scope: sample
type: fastq
msh:
- Project mashed fasta reference. Sample mashed fastq query
Returns table of mash dist of sample against project reference. One table per sample
dist:
module: mash_dist
base: [sketch_proj,sketch_smp]
script_path: "{Vars.paths.mash} dist"
reference:
scope: project
type: fasta
msh:
query:
scope: sample
type: fastq
msh:
- Project mashed reference. Project mashed fastq query
Returns table of mash dist of project sketch against project sketch. One table for the whole project.
If the project sketch is built from sample sketches, as is created by
mash_sketch
usingscope=project
andsrc_scope=sample
, the result will be an all-agianst-all mash dist table.
dist:
module: mash_dist
base: [sketch_proj,sketch_smp]
script_path: "{Vars.paths.mash} dist"
reference:
scope: project
type: fastq
msh:
query:
scope: project
type: fastq
msh:
- Project mashed fastq reference. Sample mashed fastq query
Returns table of mash dist of project sketch against teach sample sketch. One table per sample.
dist:
module: mash_dist
base: [sketch_proj,sketch_smp]
script_path: "{Vars.paths.mash} dist"
reference:
scope: project
type: fastq
msh:
query:
scope: sample
type: fastq
msh:
References
Ondov, Brian D., et al. Mash: fast genome and metagenome distance estimation using MinHash Genome biology, 17.1 (2016): 132.
Comments