silvio.extensions.sets.shotgun_sequencing package¶
This sub-module contains methods to simulate second-generation shotgun sequencing, and tools to visualize and evaluate fragment assembly.
Submodules¶
silvio.extensions.sets.shotgun_sequencing.assembly module¶
-
class
silvio.extensions.sets.shotgun_sequencing.assembly.ContigAssembler(name: str, seed: Optional[int] = None)[source]¶ Bases:
silvio.tool.Tool,abc.ABCAbstract class that defines the interface of all assemblers.
-
_abc_impl= <_abc._abc_data object>¶
-
-
class
silvio.extensions.sets.shotgun_sequencing.assembly.GreedyContigAssembler(name: Optional[str] = None, seed: Optional[int] = None)[source]¶ Bases:
silvio.extensions.sets.shotgun_sequencing.assembly.ContigAssemblerThis assembler will try to assemble the DNA scaffolds by using greedy pairwise matching. It will start with the pair of sequences with highest pairwise matching and from there one iteratively add one of the remaining sequences by matching it with the consensus sequence of all already clustered sequences.
Attention: It will extract both sequences of paired-end reads, but is not able to use their pairing relationship during assembly.
-
_abc_impl= <_abc._abc_data object>¶
-
-
class
silvio.extensions.sets.shotgun_sequencing.assembly.PairwiseScore(i, k, score)[source]¶ Bases:
tuple-
_asdict()¶ Return a new dict which maps field names to their values.
-
_field_defaults= {}¶
-
_fields= ('i', 'k', 'score')¶
-
classmethod
_make(iterable)¶ Make a new PairwiseScore object from a sequence or iterable
-
_replace(**kwds)¶ Return a new PairwiseScore object replacing specified fields with new values
-
i¶ Alias for field number 0
-
k¶ Alias for field number 1
-
score¶ Alias for field number 2
-
-
class
silvio.extensions.sets.shotgun_sequencing.assembly.RandomContigAssembler(expected_genome_size: int, name: Optional[str] = None, seed: Optional[int] = None)[source]¶ Bases:
silvio.extensions.sets.shotgun_sequencing.assembly.ContigAssemblerThe random assembler will place all contigs in the scaffolds in random locations. Good assemblers should aspire to at least be better than this random assembler.
-
_abc_impl= <_abc._abc_data object>¶
-
silvio.extensions.sets.shotgun_sequencing.datatype module¶
-
class
silvio.extensions.sets.shotgun_sequencing.datatype.EstimatedSequence(base_props: Dict[Literal[A, C, G, T], List[float]])[source]¶ Bases:
objectAn EstimatedSequence can hold a base probability for each of the positions in the gene sequence. Assemblers that split their certainty of a base call between two or more different bases can express that with these probabilities. Deterministic sequences can also be converted to estimated sequences where each corresponding base call is rated with a full probability.
-
class
silvio.extensions.sets.shotgun_sequencing.datatype.LocalizedSequence[source]¶ Bases:
tupleA sequence with a starting position.
-
_asdict()¶ Return a new dict which maps field names to their values.
-
_field_defaults= {}¶
-
_fields= ('sequence', 'locus')¶
-
classmethod
_make(iterable)¶ Make a new LocalizedSequence object from a sequence or iterable
-
_replace(**kwds)¶ Return a new LocalizedSequence object replacing specified fields with new values
-
locus¶ Alias for field number 1
-
sequence¶ Alias for field number 0
-
-
class
silvio.extensions.sets.shotgun_sequencing.datatype.Scaffold[source]¶ Bases:
tupleA sequence of bases, with added information about read quality and relative distances. Read quality is expressed in a ratio between 0 (worst) and 1 (best). Can either have a single contig or 2 contigs with paired ends. The R2 contig has a reversed sequence. The expected expected_len includes both contigs and the gap.
R1 contig | gap | R2 contig |- [0123=========] - - - - - - - - - - - - - [=========3210]
- —> order order <—
-
_asdict()¶ Return a new dict which maps field names to their values.
-
_field_defaults= {}¶
-
_fields= ('expected_len', 'r1_seqrecord', 'r2_seqrecord')¶
-
classmethod
_make(iterable)¶ Make a new Scaffold object from a sequence or iterable
-
_replace(**kwds)¶ Return a new Scaffold object replacing specified fields with new values
-
expected_len¶ Alias for field number 0
-
r1_seqrecord¶ Alias for field number 1
-
r2_seqrecord¶ Alias for field number 2
-
silvio.extensions.sets.shotgun_sequencing.datatype.estimate_from_overlap(locseqs: List[silvio.extensions.sets.shotgun_sequencing.datatype.LocalizedSequence]) → silvio.extensions.sets.shotgun_sequencing.datatype.EstimatedSequence[source]¶ Given a list of localized sequences, overlap them and measure the probability of bases their sum returns. Similar to the get_consensus_from_overlap method, but the probabilities are kept instead of the best base.
-
silvio.extensions.sets.shotgun_sequencing.datatype.get_consensus_from_overlap(locseqs: List[silvio.extensions.sets.shotgun_sequencing.datatype.LocalizedSequence]) → silvio.extensions.sets.shotgun_sequencing.datatype.LocalizedSequence[source]¶ Output the consensus sequence of multiple localized sequences. Returns: ( min starting locus, max ending locus, consensus sequence )
silvio.extensions.sets.shotgun_sequencing.evaluation module¶
-
silvio.extensions.sets.shotgun_sequencing.evaluation.calc_sequence_score(locseqs: List[silvio.extensions.sets.shotgun_sequencing.datatype.LocalizedSequence], genome: Bio.Seq.Seq, genome_start: int) → List[float][source]¶ Calculate the matching score of each singular localized sequence.
-
silvio.extensions.sets.shotgun_sequencing.evaluation.calc_total_score(estseq: silvio.extensions.sets.shotgun_sequencing.datatype.EstimatedSequence, genome: Bio.Seq.Seq) → Tuple[float, int][source]¶ Shift the genome around the heatmap and get the best overall match. Score is a simple addition of all ratios of the correct base, normalized as percentage.
-
silvio.extensions.sets.shotgun_sequencing.evaluation.evaluate_sequence(estimated_sequence: silvio.extensions.sets.shotgun_sequencing.datatype.EstimatedSequence, genome: Bio.Seq.Seq) → float[source]¶ Compares an estimated sequence with the real genome and returns the true positive rate. Return value is a number between 0 (worst score) and 1 (best score).
silvio.extensions.sets.shotgun_sequencing.sequencing module¶
-
class
silvio.extensions.sets.shotgun_sequencing.sequencing.RatedSequence(name: str, seq: Optional[List[Literal[A, C, G, T]]], qual: Optional[List[float]])[source]¶ Bases:
objectInternal structure to store sequences and their quality score.
-
class
silvio.extensions.sets.shotgun_sequencing.sequencing.ShotgunSequencer(library_size_mean: float = 400, library_size_sd: float = 75, read_method: Literal[single-read, paired-end] = 'single-read', read_length: int = 150, average_coverage: float = 10, call_error_beta: float = 2.85, name: Optional[str] = None, seed: Optional[int] = None)[source]¶ Bases:
silvio.tool.ToolFunction that, after setup, can take in a genome and perform whole-genome sequencing on it. The result will simulate a sequencing machine and provide a set of fragmented genome libraries that can later be assembled together to reconstruct the real genome of the host.
Sequencing machines can be highly customizable and allow a multitude of parameters. Some considerations about the type of parameters and their default values are given below:
- Usually a library size selection is done and this simulator will perform similar effects to those described under “Double-sided size selection and bead clean-up”. A simplification is done and the blue graph in the link is converted to a normal distribution. https://emea.support.illumina.com/bulletins/2020/07/library-size-selection-using-sample-purification-beads.html
- The read length of each fragment can also be configured, but a limit at roughly 150bp is set for each library, with some versions achieving up to 300bp. This limit is for each read. For example, with a value of 150bp, running the simulation on single-read method will yield libraries of 150bp each, whereas a paired-end method will yield two linked sequences of 150bp each.
- Coverage of the extracted libraries can be controlled by reagents and number of cycles in the cloning step. The distribution of covered regions, however, depends on how the fragments attach to the flow cell prior to cluster forming. Attachment to the flow cell will follow random sampling from the available fragments. Given the amount of fragments this can be sampling to “sampling with reposition”. The average coverage and read length; the genome size is known before-hand; and the number of reads will be calculated from all these values by using the Lander/Waterman equation. https://www.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf
- Quality of each call is difficult to simulate as it depends on multiple factors, such as: luminosity of dNTPs, signal-to-noise ratio, hardware of the sequencer and chemistry used. It defined by Illumina through empirical observations. For the simulation, an exponenetial distribution is used such, under default values, a Q30 score is achieved at 97% of the times. https://www.illumina.com/Documents/products/technotes/technote_Q-Scores.pdf
The sequences that are output will only contain the target region of each library. Adapter, index and tag regions are discarded.
silvio.extensions.sets.shotgun_sequencing.storage module¶
-
silvio.extensions.sets.shotgun_sequencing.storage.write_scaffolds_to_file(scaffolds: List[silvio.extensions.sets.shotgun_sequencing.datatype.Scaffold], r1_path: str, r2_path: Optional[str] = None)[source]¶ Write a list of scaffolds into the filesystem as a FASTQ file. This method will detect paired-end sequences and store them in two different files in their reversed form.
silvio.extensions.sets.shotgun_sequencing.visualization module¶
This file contains helper methods to visualize data structures and aid in debugging.
-
silvio.extensions.sets.shotgun_sequencing.visualization.numtochar(num: int) → str[source]¶ Convert a number to a visual character.
-
silvio.extensions.sets.shotgun_sequencing.visualization.print_assembly_evaluation(loc_sequences: List[silvio.extensions.sets.shotgun_sequencing.datatype.LocalizedSequence], genome: Bio.Seq.Seq) → None[source]¶ Given a list of localized sequences, print a visual representation of the evaluation.
-
silvio.extensions.sets.shotgun_sequencing.visualization.print_estimation_evaluation(est_sequence: silvio.extensions.sets.shotgun_sequencing.datatype.EstimatedSequence, genome: Bio.Seq.Seq) → None[source]¶ Given only an estimated sequence, print a visual representation of the evaluation.
-
silvio.extensions.sets.shotgun_sequencing.visualization.print_evaluation_alignment(score: float, genome_start: int, genome: Bio.Seq.Seq, estseq: silvio.extensions.sets.shotgun_sequencing.datatype.EstimatedSequence, locseqs: List[silvio.extensions.sets.shotgun_sequencing.datatype.LocalizedSequence] = None) → None[source]¶ Print a visual representation of how alignment was evaluated.