Sequence-Searching Related Tasks

Modules included in this section

makeblastdb ^*
blast
parse_blast
Gassst
hmmscan
mash_sketch
mash_dist

`makeblastdb` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Create a blastdb from a fasta file

Requires

fastq files in the following slots:
- sample_data[<sample>]["fasta.nucl"|"fasta.prot"]
Or (if ‘projectBLAST’ is set)
- sample_data["fasta.nucl"|"fasta.prot"]

Output

A BLAST database in the following slots:
- sample_data[<sample>]["blastdb"]
- sample_data[<sample>]["blastdb.nucl"|"blastdb.prot"]
- sample_data[<sample>]["blastdb.nucl.log"|"blastdb.prot.log"]
Or (if ‘projectBLAST’ is set):
- sample_data["blastdb"]
- sample_data["blastdb.nucl"|"blastdb.prot"]
- sample_data["blastdb.nucl.log"|"blastdb.prot.log"]

Parameters that can be set:

Parameter	Values	Comments
scope	sample\|project	Set if project-wide or sample fasta slot should be used
-dbtype	nucl/prot	This is a compulsory redirected parameter.Helps the module decide which fasta file to use.

Lines for parameter file

mkblst1:
    module: makeblastdb
    base: trinity1
    script_path: /path/to/bin/makeblastdb
    redirects:
        -dbtype: nucl
    scope: project

References

Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 25(17), pp.3389-3402.

`blast`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for executing BLAST of any type on a nucleotide or protein fasta file. The search can be either on a sample fasta or on a project-wide fasta. It can use the fasta as a database or as a query. If used as a database, you must call the makeblastdb module prior to this step.

both query and db parameters must be passed. They should be set to one of the following values:

Value	Description
`sample`	The `query` or `db` should be taken from the sample scope
`project`	The `query` or `db` should be taken from the project scope
A path	A path to a fasta file or `makeblastdb` database to use as-is

The type of fasta and database to use are set with the querytype and dbtype parameters, respectively.

dbtype must be set if db is set to sample or project.
querytype must be set regardless. It will determine the type of blast report (i.e. whether it will be stored in blast.nucl or blast.prot)

Requires:

fasta files in one of the following slots for sample-wise blast:
- sample_data[<sample>]["fasta.nucl"]
- sample_data[<sample>]["fasta.prot"]
or fasta files in one of the following slots for project-wise blast:
- sample_data["fasta.nucl"]
- sample_data["fasta.prot"]
or a makeblastdb index in one of the following slots:
- When -db is set to ‘project’
  sample_data["blastdb.nucl"|"blastdb.prot"]
- When -db is set to ‘sample’
  sample_data[<sample>]["blastdb.nucl"|"blastdb.prot"]

File type	Scope	Comments
`fasta.nucl`	sample/project	If `query` is `sample` or `project` and `querytype` is `nucl`
`fasta.prot`	sample/project	If `query` is `sample` or `project` and `querytype` is `prot`
`blastdb.nucl`	sample/project	If `db` is `sample` or `project` and `dbtype` is `nucl`
`blastdb.prot`	sample/project	If `db` is `sample` or `project` and `dbtype` is `prot`

Output:

puts BLAST output files in the following slots for sample-wise blast:
- sample_data[<sample>]["blast.nucl"|"blast.prot"]
- sample_data[<sample>]["blast"]
puts fasta output files in the following slots for project-wise blast:
- sample_data["blast.nucl"|"blast.prot"]
- sample_data["blast"]

File type	Scope	Comments
`blast.nucl`	sample/project	Blast report if `querytype` is `nucl`
`blast.prot`	sample/project	Blast report if `querytype` is `prot`
`blast`	sample/project	Blast report, regardless of `querytype`

Parameters that can be set

Parameter	Values	Comments
dbtype	nucl\|prot	Helps the module decide which blastdb to use.
querytype	nucl\|prot	Helps the module decide which fasta file to use.
query	sample\|project\|<Path to fasta or BLAST index>	Set to `sample` for sample-scope query, to `project` for project-scope query, or to a path for an external query file.
db	sample\|project\|<Path to BLAST index>	Set to `sample` for sample-scope index, to `project` for project-scope index, or to a path for an external index.

Note

You can’t set both db and query to external files. One of them at least has to be sample or project.

Lines for parameter file

External query, project-wise nucl-type database (must be proceeded by makeblastdb module):

tbl_blst_int:
    module:             blast
    base:               mkblst1
    script_path:        {Vars.Programs.blast.Bin}/blastn
    query:              /path/to/query.fasta
    db:                 project
    dbtype:             nucl
    redirects:
        -evalue:        0.0001
        -max_target_seqs: 5
        -num_of_proc:   20
        -num_threads:   20

Sample specific prot-type fasta, external database:

tbl_blst_ext:
    module:             blast
    base:               prokka1
    script_path:        {Vars.Programs.blast.Bin}/blastp
    query: sample
    querytype:          prot
    db:                 {Vars.Genome.blast_index}
    redirects:
        -evalue: 0.0001

References

Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 25(17), pp.3389-3402.

`parse_blast`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running parse_blast.R:

The parse_blast.R script is available on github.

The program performs the following tasks:

It adds annotation to raw tabular BLAST output files,
filters the BLAST results by several possible fields,
selects the best hit for a group when passed a grouping field and
extracts the sequences equivalent to the alignments.

Requires

Tabular BLAST result files in the following slots:
- sample_data[<sample>]["blast.nucl|blast.prot"] (if scope set to sample)
- sample_data["project_data"]["blast.nucl|blast.prot"] (if scope set to project)

File type	Scope	Comments
`blast.nucl\|blast.prot`	sample/project	A blast report for a `nucl` or `prot` query

Attention

If both blast.nucl and blast.prot exist, determine which to use by setting fasta2use. See parameter table below.

Output

Puts the parsed report in:
- sample_data[<sample>]["blast.parsed"] if scope = sample
- sample_data["project_data"]["blast.parsed"] if scope = project

File type	Scope	Comments
`blast.parsed`	sample/project	Results of parsed blast report

Parameters that can be set

Parameter	Values	Comments
fasta2use	`nucl\|prot`	If both nucl and prot BLAST reports exist, you have to specify which one to use with this parameter.
blast_merge		Block with `path` set to path of `compare_blast_parsed_reports.R` and `redirects` set to `compare_blast_parsed_reports.R` parameters.
extract_fasta		Should the script extract a fasta of the hits?

Note

path in blast_merge block can be left empty. The script will be taken from the same location as the main parse_blast.R script. redirects in blast_merge block can be either in string format or the regular block format.

Lines for parameter file

parse_blast_table:
    module: parse_blast
    base: blst_table
    script_path: {Vars.paths.parse_blast}
    scope: sample
    redirects:
        --columns2keep: '"group name accession qseqid sallseqid evalue bitscore score pident coverage align_len"'
        --dbtable: {Vars.databases.gene_list.table}
        --group_dif_name: # See parse_blast.R documentation for how this is to be specified
        --max_evalue: 1e-7
        --merge_blast: qseqid
        --merge_metadata: # See parse_blast.R documentation for how this is to be specified
        --min_align_len: 30
        --min_coverage: 60
        --names: '"qseqid sallseqid qlen slen qstart qend sstart send length evalue bitscore score pident qframe"'
        --num_hits: 1
    extract_fasta:
    blast_merge:
        path: '{Vars.paths.compare_blast_parsed_reports}'
        redirects:
            --variable:     evalue
            --full_txt_output:

`Gassst`

Authors: Liron Levin
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

Short Description

A module for executing Gassst on a nucleotide fasta file. The search can be either on a sample fasta or on a project-wide fasta. It can use the fasta as a database or as a query.

Requires

fasta files in the following slot for sample-wise Gassst:

sample_data[<sample>]["fasta.nucl"]

or fasta files in the following slots for project-wise Gassst:

sample_data["fasta.nucl"]

Output

puts Gassst output files in the following slots for sample-wise Gassst:

sample_data[<sample>]["blast"]

sample_data[<sample>]["blast.nucl"]

puts fasta output files in the following slots for project-wise Gassst:

sample_data["blast"]

sample_data["blast.nucl"]

Parameters that can be set

Parameter	Values	Comments
scope	project/sample	Set if project-wide fasta.nucl file type should be used [project] the default is sample-wide fasta.nucl file type

Comments

This module was tested on:
Gassst v1.28

The following python packages are required:
pandas

Only -d [database] or -i [query] not both

The Gassst module will generate blast like output with fields:
`"qseqid sallseqid qlen slen qstart qend sstart send length evalue sseq"

Lines for parameter file

Step_Name:                         # Name of this step
    module: Gassst                 # Name of the module to use
    base:                          # Name of the step [or list of names] to run after [mast be after a fasta generating step]
    script_path:                   # Command for running the Gassst script
                                   # The Gassst module will generate blast like output with fields:
                                   # "qseqid sallseqid qlen slen qstart qend sstart send length evalue sseq"
    scope:                         # Set if project-wide fasta.nucl file type should be used [project] the default is sample-wide fasta.nucl file type
    qsub_params:
        -pe:                       # Number of CPUs to reserve for this analysis
    redirects:
        -h:                        # Max hits per query, for downstream best hit will be chosen!
        -i:                        # Only -d [database] or -i [query] not both
        -l:                        # Complexity_filter off
        -d:                        # Only -d [database] or -i [query] not both
        -n:                        # Number of CPUs running Gassst
        -p:                        # Minimum percentage of identity. Must be in the interval [0 100]

References

Rizk, Guillaume, and Dominique Lavenier. “GASSST: global alignment short sequence search tool.” Bioinformatics 26.20 (2010): 2534-2540.‏

`hmmscan`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for searching a fasta file with hmmscan.

Requires

If scope = sample, fasta files in one of the following slots:
- sample_data[<sample>]["fasta.nucl"]
- sample_data[<sample>]["fasta.prot"]
If scope = project, fasta files in one of the following slots:
- sample_data["fasta.nucl"]
- sample_data["fasta.prot"]

Output:

puts hmmscan output files in the following slots:
- for scope = sample (depending on type passed):
  sample_data[<sample>]["hmmscan.nucl"]
  
  sample_data[<sample>]["hmmscan.prot"]
- for scope = project (depending on type passed):
  sample_data["hmmscan.nucl"]
  
  sample_data["hmmscan.prot"]

Parameters that can be set

Parameter	Values	Comments
scope	sample\|project	Create one assembly for all samples or one assembly per sample.
type		Use a prot or nucl fasta file for the search.
output_type	tblout\|domtblout\|pfamtblout	tblout: parseable table of per-sequence hits to file, domtblout: parseable table of per-domain hits to file, pfamtblout: table of hits and domains in Pfam format
hmmdb		A path to the hmmdb to search against.

Lines for parameter file

trino_hmmscan1_highExpr:
    module:             hmmscan
    base:               trino_Transdecode_highExpr
    script_path:        {Vars.paths.hmmscan}
    scope:              sample
    type:               prot
    output_type:        domtblout 
    hmmdb:              {Vars.databases.trinotate.pfam}
    qsub_params:
        -pe:            shared 10
    redirects:
        --cpu:          1

References

Finn, Robert D., Jody Clements, and Sean R. Eddy. “HMMER web server: interactive sequence similarity searching.” Nucleic acids research 39.suppl_2 (2011): W29-W37.

`mash_sketch`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Build mash sketches from sequence files.

Works in three modes:

scope=sample
Builds a separate sketch for each sample
scope=project and src_scope=sample
Builds a project wide sketch from sample sequence files. This can be used with mash_dist module to perform all-against-all comparisons.
scope=project
Builds a sketch from project sequence files.

Requires:

fasta files in one of the following slots:
- sample_data[<sample>]["fasta.nucl"]
or fastq files in the following slots:
- sample_data[<sample>]["fastq.F"]
- sample_data[<sample>]["fastq.R"]
- sample_data[<sample>]["fastq.S"]
For scope = project, uses project-wide files.

Output:

puts ‘msh’ output files in the following slots for (scope=sample):
- sample_data[<sample>]["msh.fasta"]
- sample_data[<sample>]["msh.fastq"]
puts ‘msh’ output files in the following slots for (scope=project):
- sample_data["project_data"]["msh.fasta"]
- sample_data["project_data"]["msh.fastq"]

Parameters that can be set

Parameter	Values	Comments
scope	project\|sample	The scope for which to build the sketch.
src_scope	project\|sample	The scope from which to take the sequence files. Default - same as `scope`
type	nucl\|prot	Use fastq or fasta files. By default, uses any that exist.

Lines for parameter file

Create sketch for each sample based on fastq files

sketch_smp:
    module:         mash_sketch
    base:           trim_gal
    script_path:    "{Vars.paths.mash} sketch"
    scope:          sample
    type:           fastq
    rm_merged:
    qsub_params:
        -pe:        shared 10
    redirects:
        -m:         2
        -p:         10

Create project sketch for all samples’ fastq files

sketch_proj:
    module:         mash_sketch
    base:           merge1
    script_path:    "{Vars.paths.mash} sketch"
    src_scope:      sample
    scope:          project
    type:           fastq
    rm_merged:
    redirects:
        -m:         2
        -p:         10

References

Ondov, Brian D., et al. Mash: fast genome and metagenome distance estimation using MinHash Genome biology, 17.1 (2016): 132.

`mash_dist`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Requires:

fasta files in one of the following slots:
- sample_data[<sample>]["fasta.nucl"]
- sample_data["fasta.nucl"]
OR fastq files in one of the following slots (merge fastq files first with mash_sketch or otherwise):
- sample_data[<sample>]["fastq"]
- sample_data["fastq"]
OR sketch files in one of the following slots:
- sample_data[<sample>]["msh.fastq"]
- sample_data[<sample>]["msh.fasta"]
- sample_data["msh.fastq"]
- sample_data["msh.fasta"]

Output:

puts ‘msh’ output files in the following slots for (scope=sample):
- sample_data[<sample>]["msh.fasta"]
- sample_data[<sample>]["msh.fastq"]
puts ‘msh’ output files in the following slots for (scope=project and scope=all_samples):
- sample_data[<sample>]["mash.dist.table"]
- sample_data["mash.dist.table"]

Parameters that can be set

Parameter	Values	Comments
reference		A block including ‘path’ or ‘scope’, ‘type’ and optionally ‘msh’
query		A block including ‘scope’ (sample, project or all_samples), ‘type’ and optionally ‘msh’

Lines for parameter file

External reference. Sample-wise fastq files.
Returns table of mash dist of sample against external reference. One table per sample

dist:
    module:         mash_dist
    base:           [sketch_proj,sketch_smp]
    script_path:    "{Vars.paths.mash} dist"
    reference:
        path:   /path/to/ref1
    query:
        scope:          sample
        type:           fastq
        msh:

Project mashed fasta reference. Sample mashed fastq query
Returns table of mash dist of sample against project reference. One table per sample

dist:
    module:         mash_dist
    base:           [sketch_proj,sketch_smp]
    script_path:    "{Vars.paths.mash} dist"
    reference:
        scope:      project
        type:       fasta
        msh:
    query:
        scope:      sample
        type:       fastq
        msh:

Project mashed reference. Project mashed fastq query
Returns table of mash dist of project sketch against project sketch. One table for the whole project.

If the project sketch is built from sample sketches, as is created by mash_sketch using scope=project and src_scope=sample, the result will be an all-agianst-all mash dist table.

dist:
    module:         mash_dist
    base:           [sketch_proj,sketch_smp]
    script_path:    "{Vars.paths.mash} dist"
    reference:
        scope:      project
        type:       fastq
        msh:
    query:
        scope:      project
        type:       fastq
        msh:

Project mashed fastq reference. Sample mashed fastq query
Returns table of mash dist of project sketch against teach sample sketch. One table per sample.

dist: 
    module:         mash_dist
    base:           [sketch_proj,sketch_smp]
    script_path:    "{Vars.paths.mash} dist"
    reference:
        scope:      project
        type:       fastq
        msh:
    query:
        scope:      sample
        type:       fastq
        msh:

References

Ondov, Brian D., et al. Mash: fast genome and metagenome distance estimation using MinHash Genome biology, 17.1 (2016): 132.

Sequence-Searching Related Tasks

makeblastdb *

Requires

Output

Parameters that can be set:

Lines for parameter file

References

blast

Requires:

Output:

Parameters that can be set

Lines for parameter file

References

parse_blast

Requires

Output

Parameters that can be set

Lines for parameter file

Gassst

Short Description

Requires

Output

Parameters that can be set

Comments

Lines for parameter file

References

hmmscan

Requires

Output:

Parameters that can be set

Lines for parameter file

References

mash_sketch

Requires:

Output:

Parameters that can be set

Lines for parameter file

References

mash_dist

Requires:

Output:

Parameters that can be set

Lines for parameter file

References

`makeblastdb` ^*

`blast`

`parse_blast`

`Gassst`

`hmmscan`

`mash_sketch`

`mash_dist`