Skip to content

Analysis

analysis

This module provides functions for analyzing exclusive k-mers obtained using the maximum entropy principle.

Contents
  • mutations_analysis: Perform k-mers analysis and optionally generate a report.
  • variants_analysis: Perform variants analysis based on intersection selection.
Todo
  • Implement tests.

message = Messages() module-attribute

Set the Message class for logging.

mutations_analysis(seq_path, ref_path, seq_kmers_exclusive, kmers_positions, word, step, snps_max, annotation_dataframe, sequence_interval, mode='snps', create_report=False, chunk_size=100)

Perform k-mers analysis and optionally generate a report.

This function performs k-mers analysis on the provided sequence data, exclusive k-mers, and annotations. It calculates exclusive adjacencies, checks differences, and returns results in a tuple. If 'create_report' is set to True, a report is generated.

Parameters:

Name Type Description Default
seq_path str

The path to the file containing the sequences in FASTA format.

required
ref_path str

The path to the reference sequence data file.

required
seq_kmers_exclusive list[str]

A list of exclusive k-mers.

required
word int

The length of each k-mer.

required
step int

The step size for moving the sliding window.

required
snps_max int

The maximum number of SNPs allowed.

required
annotation_dataframe DataFrame

DataFrame containing sequence annotations.

required
sequence_interval Series

Series containing sequence intervals.

required
create_report bool

Whether to generate a report. Default is False.

False
chunk_size int

The chunk size for loading sequences. Default is 100.

100

Returns:

Type Description
tuple[defaultdict[str, list[str]], ndarray] | tuple[None, None] | tuple[defaultdict[str, list[str]], None]

tuple[defaultdict[str, list[str]], np.ndarray]: A tuple containing results of k-mers analysis and optionally a generated report.

Source code in python/gramep/analysis.py
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
def mutations_analysis(
    seq_path: str,
    ref_path: str,
    seq_kmers_exclusive: list[str],
    kmers_positions: defaultdict[str, list[int]],
    word: int,
    step: int,
    snps_max: int,
    annotation_dataframe: pd.DataFrame,
    sequence_interval: pd.Series,
    mode: str = 'snps',
    create_report: bool = False,
    chunk_size: int = 100,
) -> tuple[defaultdict[str, list[str]], np.ndarray] | tuple[
    None, None
] | tuple[defaultdict[str, list[str]], None]:
    """
    Perform k-mers analysis and optionally generate a report.

    This function performs k-mers analysis on the provided sequence data, exclusive \
        k-mers, and annotations. It calculates exclusive adjacencies, checks \
        differences, and returns results in a tuple. If 'create_report' is \
        set to True, a report is generated.

    Args:
        seq_path (str): The path to the file containing the sequences in FASTA format.
        ref_path (str): The path to the reference sequence data file.
        seq_kmers_exclusive (list[str]): A list of exclusive k-mers.
        word (int): The length of each k-mer.
        step (int): The step size for moving the sliding window.
        snps_max (int): The maximum number of SNPs allowed.
        annotation_dataframe (pd.DataFrame): DataFrame containing sequence annotations.
        sequence_interval (pd.Series): Series containing sequence intervals.
        create_report (bool, optional): Whether to generate a report. Default is False.
        chunk_size (int, optional): The chunk size for loading sequences. \
        Default is 100.

    Returns:
        tuple[defaultdict[str, list[str]], np.ndarray]: A tuple containing results \
            of k-mers analysis and optionally a generated report.
    """

    progress = Progress(
        SpinnerColumn(),
        TaskProgressColumn(),
        TextColumn('[progress.description]{task.description}'),
        BarColumn(),
        TimeElapsedColumn(),
    )

    if len(seq_kmers_exclusive) == 0:
        message.error_no_exclusive_kmers()
        return None, None

    with progress:
        progress.add_task(
            '[cyan] Getting SNPs positions...',
            total=None,
        )
        diffs_positions = kmers_analysis(
            seq_path=seq_path,
            ref_path=ref_path,
            exclusive_kmers=seq_kmers_exclusive,
            final_positions=kmers_positions,
            k=word,
            step=step,
            max_dist=snps_max,
            mode=mode,
            batch_size=chunk_size,
        )

    if create_report:
        with joblib_progress(
            'Creating report ...', total=len(diffs_positions)
        ):
            report_list = Parallel(n_jobs=-2)(
                delayed(make_report)(
                    diffs_positions[key],
                    key,
                    sequence_interval,
                    annotation_dataframe,
                )
                for key in diffs_positions.keys()
            )
        report = np.hstack(report_list)

        return diffs_positions, report
    else:
        return diffs_positions, None

variants_analysis(save_path, intersection_seletion='ALL')

Perform variants analysis based on intersection selection.

This function performs variants analysis based on the specified intersection selection criteria. It reads variant data from the provided file and returns a defaultdict containing analysis results.

Parameters:

Name Type Description Default
save_path str

The path to the file containing variant data.

required
intersection_seletion str

Criteria for selecting which variants to intersect. To specify the variants for intersection, provide them separated by '-'. For example: 'variant1-variant2-variant3'. Default is 'ALL'.

'ALL'

Returns:

Type Description
defaultdict[str, list[str]]

defaultdict[str, list[str]]: A defaultdict containing analysis results.

Source code in python/gramep/analysis.py
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
def variants_analysis(
    save_path: str, intersection_seletion: str = 'ALL'
) -> defaultdict[str, list[str]]:
    """
    Perform variants analysis based on intersection selection.

    This function performs variants analysis based on the specified intersection \
        selection criteria.
    It reads variant data from the provided file and returns a defaultdict \
        containing analysis results.

    Args:
        save_path (str): The path to the file containing variant data.
        intersection_seletion (str, optional): Criteria for selecting which variants \
        to intersect. To specify the variants for intersection, provide them \
        separated by '-'. For example: 'variant1-variant2-variant3'. Default is 'ALL'.

    Returns:
        defaultdict[str, list[str]]: A defaultdict containing analysis results.

    """
    variants_exclusive_kmers, variants_names = load_variants_exclusive(
        save_path
    )

    intersection_kmers, intersection_kmers_sets = variants_intersection(
        variants_exclusive_kmers, variants_names, intersection_seletion
    )

    with open(save_path + '/intersections.txt', 'a') as export_file:
        export_file.write('INTERSECTIONS\n')
        for k, v in intersection_kmers.items():
            export_file.write(str(k) + ': ' + str(', '.join(v)) + '\n')

    plot(from_contents(intersection_kmers_sets))
    plt.savefig(save_path + '/intersections.png')

    message.info_intersections_saved(save_path + '/intersections.txt')
    return intersection_kmers