Analysis

analysis

This module provides functions for analyzing exclusive k-mers obtained using the maximum entropy principle.

Contents

mutations_analysis: Perform k-mers analysis and optionally generate a report.
variants_analysis: Perform variants analysis based on intersection selection.

Todo

Implement tests.

`message = Messages()` `module-attribute`

Set the Message class for logging.

`mutations_analysis(seq_path, ref_path, seq_kmers_exclusive, kmers_positions, word, step, snps_max, annotation_dataframe, sequence_interval, mode='snps', create_report=False, chunk_size=100)`

Perform k-mers analysis and optionally generate a report.

This function performs k-mers analysis on the provided sequence data, exclusive k-mers, and annotations. It calculates exclusive adjacencies, checks differences, and returns results in a tuple. If 'create_report' is set to True, a report is generated.

Parameters:

Name	Type	Description	Default
`seq_path`	`str`	The path to the file containing the sequences in FASTA format.	required
`ref_path`	`str`	The path to the reference sequence data file.	required
`seq_kmers_exclusive`	`list[str]`	A list of exclusive k-mers.	required
`word`	`int`	The length of each k-mer.	required
`step`	`int`	The step size for moving the sliding window.	required
`snps_max`	`int`	The maximum number of SNPs allowed.	required
`annotation_dataframe`	`DataFrame`	DataFrame containing sequence annotations.	required
`sequence_interval`	`Series`	Series containing sequence intervals.	required
`create_report`	`bool`	Whether to generate a report. Default is False.	`False`
`chunk_size`	`int`	The chunk size for loading sequences. Default is 100.	`100`

Returns:

Type	Description
`tuple[defaultdict[str, list[str]], ndarray] \| tuple[None, None] \| tuple[defaultdict[str, list[str]], None]`	tuple[defaultdict[str, list[str]], np.ndarray]: A tuple containing results of k-mers analysis and optionally a generated report.

Source code in python/gramep/analysis.py

def mutations_analysis(
    seq_path: str,
    ref_path: str,
    seq_kmers_exclusive: list[str],
    kmers_positions: defaultdict[str, list[int]],
    word: int,
    step: int,
    snps_max: int,
    annotation_dataframe: pd.DataFrame,
    sequence_interval: pd.Series,
    mode: str = 'snps',
    create_report: bool = False,
    chunk_size: int = 100,
) -> tuple[defaultdict[str, list[str]], np.ndarray] | tuple[
    None, None
] | tuple[defaultdict[str, list[str]], None]:
    """
    Perform k-mers analysis and optionally generate a report.

    This function performs k-mers analysis on the provided sequence data, exclusive \
        k-mers, and annotations. It calculates exclusive adjacencies, checks \
        differences, and returns results in a tuple. If 'create_report' is \
        set to True, a report is generated.

    Args:
        seq_path (str): The path to the file containing the sequences in FASTA format.
        ref_path (str): The path to the reference sequence data file.
        seq_kmers_exclusive (list[str]): A list of exclusive k-mers.
        word (int): The length of each k-mer.
        step (int): The step size for moving the sliding window.
        snps_max (int): The maximum number of SNPs allowed.
        annotation_dataframe (pd.DataFrame): DataFrame containing sequence annotations.
        sequence_interval (pd.Series): Series containing sequence intervals.
        create_report (bool, optional): Whether to generate a report. Default is False.
        chunk_size (int, optional): The chunk size for loading sequences. \
        Default is 100.

    Returns:
        tuple[defaultdict[str, list[str]], np.ndarray]: A tuple containing results \
            of k-mers analysis and optionally a generated report.
    """

    progress = Progress(
        SpinnerColumn(),
        TaskProgressColumn(),
        TextColumn('[progress.description]{task.description}'),
        BarColumn(),
        TimeElapsedColumn(),
    )

    if len(seq_kmers_exclusive) == 0:
        message.error_no_exclusive_kmers()
        return None, None

    with progress:
        progress.add_task(
            '[cyan] Getting SNPs positions...',
            total=None,
        )
        diffs_positions = kmers_analysis(
            seq_path=seq_path,
            ref_path=ref_path,
            exclusive_kmers=seq_kmers_exclusive,
            final_positions=kmers_positions,
            k=word,
            step=step,
            max_dist=snps_max,
            mode=mode,
            batch_size=chunk_size,
        )

    if create_report:
        with joblib_progress(
            'Creating report ...', total=len(diffs_positions)
        ):
            report_list = Parallel(n_jobs=-2)(
                delayed(make_report)(
                    diffs_positions[key],
                    key,
                    sequence_interval,
                    annotation_dataframe,
                )
                for key in diffs_positions.keys()
            )
        report = np.hstack(report_list)

        return diffs_positions, report
    else:
        return diffs_positions, None

`variants_analysis(save_path, intersection_seletion='ALL')`

Perform variants analysis based on intersection selection.

This function performs variants analysis based on the specified intersection selection criteria. It reads variant data from the provided file and returns a defaultdict containing analysis results.