silvio.extensions.sets.shotgun_sequencing package¶

This sub-module contains methods to simulate second-generation shotgun sequencing, and tools to visualize and evaluate fragment assembly.

Submodules¶

silvio.extensions.sets.shotgun_sequencing.assembly module¶

class silvio.extensions.sets.shotgun_sequencing.assembly.ContigAssembler(name: str, seed: Optional[int] = None)[source]¶

Bases: silvio.tool.Tool, abc.ABC

Abstract class that defines the interface of all assemblers.

_abc_impl = <_abc._abc_data object>¶

apply(scaffolds: List[silvio.extensions.sets.shotgun_sequencing.datatype.Scaffold]) → silvio.extensions.sets.shotgun_sequencing.datatype.EstimatedSequence[source]¶

apply_internal(scaffolds: List[silvio.extensions.sets.shotgun_sequencing.datatype.Scaffold]) → List[silvio.extensions.sets.shotgun_sequencing.datatype.LocalizedSequence][source]¶

class silvio.extensions.sets.shotgun_sequencing.assembly.GreedyContigAssembler(name: Optional[str] = None, seed: Optional[int] = None)[source]¶

Bases: silvio.extensions.sets.shotgun_sequencing.assembly.ContigAssembler

This assembler will try to assemble the DNA scaffolds by using greedy pairwise matching. It will start with the pair of sequences with highest pairwise matching and from there one iteratively add one of the remaining sequences by matching it with the consensus sequence of all already clustered sequences.

Attention: It will extract both sequences of paired-end reads, but is not able to use their pairing relationship during assembly.

_abc_impl = <_abc._abc_data object>¶

apply(scaffolds: List[silvio.extensions.sets.shotgun_sequencing.datatype.Scaffold]) → silvio.extensions.sets.shotgun_sequencing.datatype.EstimatedSequence[source]¶

apply_internal(scaffolds: List[silvio.extensions.sets.shotgun_sequencing.datatype.Scaffold]) → List[silvio.extensions.sets.shotgun_sequencing.datatype.LocalizedSequence][source]¶

class silvio.extensions.sets.shotgun_sequencing.assembly.PairwiseScore(i, k, score)[source]¶

Bases: tuple

_asdict()¶: Return a new dict which maps field names to their values.

_field_defaults = {}¶

_fields = ('i', 'k', 'score')¶

classmethod _make(iterable)¶: Make a new PairwiseScore object from a sequence or iterable

_replace(**kwds)¶: Return a new PairwiseScore object replacing specified fields with new values

i¶: Alias for field number 0

k¶: Alias for field number 1

score¶: Alias for field number 2

class silvio.extensions.sets.shotgun_sequencing.assembly.RandomContigAssembler(expected_genome_size: int, name: Optional[str] = None, seed: Optional[int] = None)[source]¶

Bases: silvio.extensions.sets.shotgun_sequencing.assembly.ContigAssembler

The random assembler will place all contigs in the scaffolds in random locations. Good assemblers should aspire to at least be better than this random assembler.

_abc_impl = <_abc._abc_data object>¶

apply(scaffolds: List[silvio.extensions.sets.shotgun_sequencing.datatype.Scaffold]) → silvio.extensions.sets.shotgun_sequencing.datatype.EstimatedSequence[source]¶

scaffolds: Scaffolds that were returned from fragmentation.

An estimation of bases for each position, based on random placement.

apply_internal(scaffolds: List[silvio.extensions.sets.shotgun_sequencing.datatype.Scaffold]) → List[silvio.extensions.sets.shotgun_sequencing.datatype.LocalizedSequence][source]¶

silvio.extensions.sets.shotgun_sequencing.assembly.match(sa: Bio.Seq.Seq, sb: Bio.Seq.Seq) → Bio.Blast.Record.Alignment[source]¶: Internal method for pairwise matching.

silvio.extensions.sets.shotgun_sequencing.datatype module¶

class silvio.extensions.sets.shotgun_sequencing.datatype.EstimatedSequence(base_props: Dict[Literal[A, C, G, T], List[float]])[source]¶

Bases: object

An EstimatedSequence can hold a base probability for each of the positions in the gene sequence. Assemblers that split their certainty of a base call between two or more different bases can express that with these probabilities. Deterministic sequences can also be converted to estimated sequences where each corresponding base call is rated with a full probability.

as_consensus_sequence() → Bio.Seq.Seq[source]¶: Outputs the consensus sequence by choosing the base with highest probability for each location.

calc_shannon_entropy() → float[source]¶: Calculate the Shannon Entropy over the own estimated sequence. The Shannon Entropy will be calculated for each position and the average will be output.

class silvio.extensions.sets.shotgun_sequencing.datatype.LocalizedSequence[source]¶

Bases: tuple

A sequence with a starting position.

_asdict()¶: Return a new dict which maps field names to their values.

_field_defaults = {}¶

_fields = ('sequence', 'locus')¶

classmethod _make(iterable)¶: Make a new LocalizedSequence object from a sequence or iterable

_replace(**kwds)¶: Return a new LocalizedSequence object replacing specified fields with new values

locus¶: Alias for field number 1

sequence¶: Alias for field number 0

class silvio.extensions.sets.shotgun_sequencing.datatype.Scaffold[source]¶

Bases: tuple

A sequence of bases, with added information about read quality and relative distances. Read quality is expressed in a ratio between 0 (worst) and 1 (best). Can either have a single contig or 2 contigs with paired ends. The R2 contig has a reversed sequence. The expected expected_len includes both contigs and the gap.

R1 contig | gap | R2 contig |

[0123=========] - - - - - - - - - - - - - [=========3210]

—> order order <—

_asdict()¶: Return a new dict which maps field names to their values.

_field_defaults = {}¶

_fields = ('expected_len', 'r1_seqrecord', 'r2_seqrecord')¶

classmethod _make(iterable)¶: Make a new Scaffold object from a sequence or iterable

_replace(**kwds)¶: Return a new Scaffold object replacing specified fields with new values

expected_len¶: Alias for field number 0

r1_seqrecord¶: Alias for field number 1

r2_seqrecord¶: Alias for field number 2

silvio.extensions.sets.shotgun_sequencing.datatype.estimate_from_overlap(locseqs: List[silvio.extensions.sets.shotgun_sequencing.datatype.LocalizedSequence]) → silvio.extensions.sets.shotgun_sequencing.datatype.EstimatedSequence[source]¶: Given a list of localized sequences, overlap them and measure the probability of bases their sum returns. Similar to the get_consensus_from_overlap method, but the probabilities are kept instead of the best base.

silvio.extensions.sets.shotgun_sequencing.datatype.get_consensus_from_overlap(locseqs: List[silvio.extensions.sets.shotgun_sequencing.datatype.LocalizedSequence]) → silvio.extensions.sets.shotgun_sequencing.datatype.LocalizedSequence[source]¶: Output the consensus sequence of multiple localized sequences. Returns: ( min starting locus, max ending locus, consensus sequence )

silvio.extensions.sets.shotgun_sequencing.evaluation module¶

silvio.extensions.sets.shotgun_sequencing.evaluation.calc_sequence_score(locseqs: List[silvio.extensions.sets.shotgun_sequencing.datatype.LocalizedSequence], genome: Bio.Seq.Seq, genome_start: int) → List[float][source]¶: Calculate the matching score of each singular localized sequence.

silvio.extensions.sets.shotgun_sequencing.evaluation.calc_total_score(estseq: silvio.extensions.sets.shotgun_sequencing.datatype.EstimatedSequence, genome: Bio.Seq.Seq) → Tuple[float, int][source]¶: Shift the genome around the heatmap and get the best overall match. Score is a simple addition of all ratios of the correct base, normalized as percentage.

silvio.extensions.sets.shotgun_sequencing.evaluation.evaluate_sequence(estimated_sequence: silvio.extensions.sets.shotgun_sequencing.datatype.EstimatedSequence, genome: Bio.Seq.Seq) → float[source]¶: Compares an estimated sequence with the real genome and returns the true positive rate. Return value is a number between 0 (worst score) and 1 (best score).

silvio.extensions.sets.shotgun_sequencing.sequencing module¶

class silvio.extensions.sets.shotgun_sequencing.sequencing.RatedSequence(name: str, seq: Optional[List[Literal[A, C, G, T]]], qual: Optional[List[float]])[source]¶

Bases: object

Internal structure to store sequences and their quality score.

class silvio.extensions.sets.shotgun_sequencing.sequencing.ShotgunSequencer(library_size_mean: float = 400, library_size_sd: float = 75, read_method: Literal[single-read, paired-end] = 'single-read', read_length: int = 150, average_coverage: float = 10, call_error_beta: float = 2.85, name: Optional[str] = None, seed: Optional[int] = None)[source]¶

Bases: silvio.tool.Tool

Function that, after setup, can take in a genome and perform whole-genome sequencing on it. The result will simulate a sequencing machine and provide a set of fragmented genome libraries that can later be assembled together to reconstruct the real genome of the host.

Sequencing machines can be highly customizable and allow a multitude of parameters. Some considerations about the type of parameters and their default values are given below:

Usually a library size selection is done and this simulator will perform similar effects to those described under “Double-sided size selection and bead clean-up”. A simplification is done and the blue graph in the link is converted to a normal distribution. https://emea.support.illumina.com/bulletins/2020/07/library-size-selection-using-sample-purification-beads.html

The read length of each fragment can also be configured, but a limit at roughly 150bp is set for each library, with some versions achieving up to 300bp. This limit is for each read. For example, with a value of 150bp, running the simulation on single-read method will yield libraries of 150bp each, whereas a paired-end method will yield two linked sequences of 150bp each.

Coverage of the extracted libraries can be controlled by reagents and number of cycles in the cloning step. The distribution of covered regions, however, depends on how the fragments attach to the flow cell prior to cluster forming. Attachment to the flow cell will follow random sampling from the available fragments. Given the amount of fragments this can be sampling to “sampling with reposition”. The average coverage and read length; the genome size is known before-hand; and the number of reads will be calculated from all these values by using the Lander/Waterman equation. https://www.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf

Quality of each call is difficult to simulate as it depends on multiple factors, such as: luminosity of dNTPs, signal-to-noise ratio, hardware of the sequencer and chemistry used. It defined by Illumina through empirical observations. For the simulation, an exponenetial distribution is used such, under default values, a Q30 score is achieved at 97% of the times. https://www.illumina.com/Documents/products/technotes/technote_Q-Scores.pdf

The sequences that are output will only contain the target region of each library. Adapter, index and tag regions are discarded.

apply(genome: Bio.Seq.Seq) → List[silvio.extensions.sets.shotgun_sequencing.datatype.Scaffold][source]¶

calc_num_scaffolds_obtained(genome: Bio.Seq.Seq) → int[source]¶: Number of frames that will be obtained. Calculated based on Lander/Waterman equation with regards to the current parameters and the host genome. This will account for the read method used. Paired-end methods will output half of it.

silvio.extensions.sets.shotgun_sequencing.sequencing.convert_to_seqrecord(rated_seq: silvio.extensions.sets.shotgun_sequencing.sequencing.RatedSequence) → Bio.SeqRecord.SeqRecord[source]¶: Convert an internal RatedSequence to the Seq that is commonly used throughout the library.

silvio.extensions.sets.shotgun_sequencing.sequencing.phred_to_prob(phred_score: float) → float[source]¶: Convert a Phred score into a probability.

silvio.extensions.sets.shotgun_sequencing.storage module¶

silvio.extensions.sets.shotgun_sequencing.storage.write_scaffolds_to_file(scaffolds: List[silvio.extensions.sets.shotgun_sequencing.datatype.Scaffold], r1_path: str, r2_path: Optional[str] = None)[source]¶: Write a list of scaffolds into the filesystem as a FASTQ file. This method will detect paired-end sequences and store them in two different files in their reversed form.

silvio.extensions.sets.shotgun_sequencing.visualization module¶

This file contains helper methods to visualize data structures and aid in debugging.

silvio.extensions.sets.shotgun_sequencing.visualization.numtochar(num: int) → str[source]¶: Convert a number to a visual character.

silvio.extensions.sets.shotgun_sequencing.visualization.print_assembly_evaluation(loc_sequences: List[silvio.extensions.sets.shotgun_sequencing.datatype.LocalizedSequence], genome: Bio.Seq.Seq) → None[source]¶: Given a list of localized sequences, print a visual representation of the evaluation.

silvio.extensions.sets.shotgun_sequencing.visualization.print_estimation_evaluation(est_sequence: silvio.extensions.sets.shotgun_sequencing.datatype.EstimatedSequence, genome: Bio.Seq.Seq) → None[source]¶: Given only an estimated sequence, print a visual representation of the evaluation.

silvio.extensions.sets.shotgun_sequencing.visualization.print_evaluation_alignment(score: float, genome_start: int, genome: Bio.Seq.Seq, estseq: silvio.extensions.sets.shotgun_sequencing.datatype.EstimatedSequence, locseqs: List[silvio.extensions.sets.shotgun_sequencing.datatype.LocalizedSequence] = None) → None[source]¶: Print a visual representation of how alignment was evaluated.

silvio.extensions.sets.shotgun_sequencing.visualization.print_scaffold(scaffold: silvio.extensions.sets.shotgun_sequencing.datatype.Scaffold) → None[source]¶: Visually print a scaffold as text.

silvio.extensions.sets.shotgun_sequencing.visualization.print_scaffold_as_fastq(scaffold: silvio.extensions.sets.shotgun_sequencing.datatype.Scaffold) → None[source]¶: Print the scaffold content as FASTQ.