Skip to content

Data io

data_io module

This module provides functions for loading and saving data.

Contents
  • annotation_dataframe: Process annotation information from a GFF3 file into a DataFrame and a sequence interval Series.
  • load_sequences: Load and select the most informative kmers from sequences * save_exclusive_kmers: Save exclusive k-mers to a file.
  • save_intersection_kmers: Save intersection k-mers to a file.
  • save_diffs_positions: Save variations and their positions to a file.
  • write_report: Write a report to a file.
  • write_frequencies: Write k-mer frequencies to a file.
  • load_variants_exclusive: Load variants and exclusive k-mers from a folder.
  • load_variants_kmers: Load and return unique exclusive k-mers from saved files.
  • save_data: Save data to files in the specified directory.
  • save_ranges: Save MinMaxScaler object ranges to a file.
  • load_ranges: Load MinMaxScaler object ranges from a file.
  • load_model: Load a RandomForestClassifier model from a file.
  • save_model: Save a RandomForestClassifier model to a file.
  • save_metrics: Save accuracy and metrics data to a file.
  • save_confusion_matrix: Save a confusion matrix plot to a file.
  • save_predict_data: Save predict data to a file.
  • load_exclusive_kmers_file: Load exclusive k-mers from a file.
  • load_mutations: Load and return the unique mutations.
  • load_reports: Load and return the reports dirnames.
Todo
  • Implement tests.

message = Messages() module-attribute

Set the Message class for logging.

annotation_dataframe(annotation_path)

Process annotation information from a GFF3 file into a DataFrame and a sequence interval Series.

Parameters:

Name Type Description Default
annotation_path str

The path to the GFF3 annotation file.

required

Returns:

Name Type Description
tuple tuple[DataFrame, Series]

A tuple containing two elements: - df_annotation (pd.DataFrame): A DataFrame with processed annotation data. - seq_interval (pd.Series): A Series containing sequence intervals generated from the data.

Source code in python/gramep/data_io.py
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def annotation_dataframe(
    annotation_path: str,
) -> tuple[pd.DataFrame, pd.Series]:
    """
    Process annotation information from a GFF3 file into a DataFrame and a sequence \
    interval Series.

    Args:
        annotation_path (str): The path to the GFF3 annotation file.

    Returns:
        tuple: A tuple containing two elements:
            - df_annotation (pd.DataFrame): A DataFrame with processed annotation data.
            - seq_interval (pd.Series): A Series containing sequence intervals \
            generated from the data.
    """
    df_annotation = gffpd.read_gff3(annotation_path).attributes_to_columns()
    df_annotation = df_annotation[['type', 'start', 'end', 'Name', 'product']][
        df_annotation['Name'].notna()
    ]
    seq_interval = df_annotation.apply(get_sequence_interval, axis=1)
    return df_annotation, seq_interval

load_model(load_model_path)

Load a RandomForestClassifier model from a file.

This function loads a trained RandomForestClassifier model from a file and returns the model.

Parameters:

Name Type Description Default
load_model_path str

The path to the file containing the trained model.

required

Returns:

Name Type Description
RandomForestClassifier RandomForestClassifier

A trained RandomForestClassifier model.

Source code in python/gramep/data_io.py
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
def load_model(load_model_path: str) -> RandomForestClassifier:
    """
    Load a RandomForestClassifier model from a file.

    This function loads a trained RandomForestClassifier model from a file and \
    returns the model.

    Args:
        load_model_path (str): The path to the file containing the trained model.

    Returns:
        RandomForestClassifier: A trained RandomForestClassifier model.
    """
    with open(load_model_path, 'rb') as handle:
        return pickle.load(handle)

load_ranges(load_ranges_path)

Load MinMaxScaler object ranges from a file.

This function loads a MinMaxScaler object's data range information from a file and returns the MinMaxScaler object.

Parameters:

Name Type Description Default
load_ranges_path str

The path to the file containing MinMaxScaler object ranges.

required

Returns:

Name Type Description
MinMaxScaler MinMaxScaler

A MinMaxScaler object with loaded data range information.

Source code in python/gramep/data_io.py
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
def load_ranges(load_ranges_path: str) -> MinMaxScaler:
    """
    Load MinMaxScaler object ranges from a file.

    This function loads a MinMaxScaler object's data range information from a \
    file and returns the MinMaxScaler object.

    Args:
        load_ranges_path (str): The path to the file containing MinMaxScaler \
        object ranges.

    Returns:
        MinMaxScaler: A MinMaxScaler object with loaded data range information.
    """

    with open(load_ranges_path, 'rb') as handle:
        return pickle.load(handle)

load_sequences(file_path, word, step, dictonary='DNA', reference=False, chunk_size=100)

Load and select the most informative kmers from sequences from a file into overlapping k-mers.

This function reads sequences from a file and select the most informative k-mers of the specified length and step size. Optionally, the function can use a custom dictionary of characters and return a list of k-mers or the reference sequence as a single string.

Parameters:

Name Type Description Default
file_path str

The path to the file containing sequences.

required
word int

The length of each k-mer.

required
step int

The step size for moving the sliding window.

required
dictonary str

A string containing characters to consider in sequences. Default is 'DNA' (DNA alphabet).

'DNA'
reference bool

If True, return the reference sequence. Default is False.

False
chunk_size int

The chunk size for loading sequences. Default is 100.

100

Returns:

Type Description
dict[str, int]

dict[str, int]: A defaultdict mapping most informative k-mers.

Source code in python/gramep/data_io.py
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
def load_sequences(
    file_path: str,
    word: int,
    step: int,
    dictonary: str = 'DNA',
    reference: bool = False,
    chunk_size: int = 100,
) -> dict[str, int]:
    """
    Load and select the most informative kmers from sequences from a file into \
    overlapping k-mers.

    This function reads sequences from a file and select the most informative
    k-mers of the specified length and step size. Optionally, the function can
    use a custom dictionary of characters and return a list of k-mers or the
    reference sequence as a single string.

    Args:
        file_path (str): The path to the file containing sequences.
        word (int): The length of each k-mer.
        step (int): The step size for moving the sliding window.
        dictonary (str, optional): A string containing characters to consider \
        in sequences. Default is 'DNA' (DNA alphabet).
        reference (bool, optional): If True, return the reference sequence.\
        Default is False.
        chunk_size (int, optional): The chunk size for loading sequences. \
        Default is 100.

    Returns:
        dict[str, int]: A defaultdict mapping most informative k-mers.
    """
    message.info_start()

    progress = Progress(
        SpinnerColumn(),
        TaskProgressColumn(),
        TextColumn('[progress.description]{task.description}'),
        BarColumn(),
        TimeElapsedColumn(),
    )

    if reference:
        loading_text = 'Loading reference sequence ...'
    else:
        loading_text = 'Loading sequences ...'

    with progress:
        progress.add_task(f'[cyan]{loading_text}', total=None)
        kmers = get_kmers(
            file_path, word, step, dictonary, reference, chunk_size
        )

    return kmers

load_variants_exclusive(save_path)

Load variants and exclusive variants from a folder.

This function reads and loads variants and exclusive variants from the specified file. It returns a tuple containing a defaultdict of variant information and a list of exclusive variants.

Parameters:

Name Type Description Default
save_path str

The path to the file containing variants folders and exclusive variants.

required

Returns:

Type Description
tuple[defaultdict[str, list[str]], list[str]]

tuple[defaultdict[str, list[str]], list[str]]: A tuple containing a defaultdict of variant information and a list of exclusive variants.

Source code in python/gramep/data_io.py
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
def load_variants_exclusive(
    save_path: str,
) -> tuple[defaultdict[str, list[str]], list[str]]:
    """
    Load variants and exclusive variants from a folder.

    This function reads and loads variants and exclusive variants from the specified file.
    It returns a tuple containing a defaultdict of variant information and a \
    list of exclusive variants.

    Args:
        save_path (str): The path to the file containing variants folders and \
        exclusive variants.

    Returns:
        tuple[defaultdict[str, list[str]], list[str]]: A tuple containing a \
        defaultdict of variant information and a list of exclusive variants.
    """
    subfolders = [f.path for f in scandir(save_path) if f.is_dir()]
    loads = defaultdict(list)
    variants_names = []
    for dirname in subfolders:
        with open(
            dirname + '/' + dirname.split(sep='/')[-1] + '_variations.sav',
            'rb',
        ) as f:
            loads[dirname.split(sep='/')[-1]] = pickle.load(f)
        variants_names.append(dirname.split(sep='/')[-1])
    return loads, variants_names

load_variants_kmers(save_path)

Load and return unique exclusive k-mers from saved files.

This function reads exclusive k-mers data from saved files located within subfolders of the specified 'save_path'. It returns a NumPy array containing the unique set of exclusive k-mers across all files.

Parameters:

Name Type Description Default
save_path str

The path to the directory containing subfolders with saved exclusive k-mers files.

required

Returns:

Name Type Description
list list[str]

A list containing the unique set of exclusive k-mers.

Source code in python/gramep/data_io.py
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
def load_variants_kmers(save_path: str) -> list[str]:
    """
    Load and return unique exclusive k-mers from saved files.

    This function reads exclusive k-mers data from saved files located within \
    subfolders of the specified
    'save_path'. It returns a NumPy array containing the unique set of exclusive \
    k-mers across all files.

    Args:
        save_path (str): The path to the directory containing subfolders with \
        saved exclusive k-mers files.

    Returns:
        list: A list containing the unique set of exclusive k-mers.
    """
    subfolders = [f.path for f in scandir(save_path) if f.is_dir()]
    loads = []
    for dirname in subfolders:
        with open(
            dirname + '/' + dirname.split(sep='/')[-1] + '_ExclusiveKmers.sav',
            'rb',
        ) as f:
            seq_kmers_exclusive = [''.join(item) for item in pickle.load(f)]
            loads.extend(seq_kmers_exclusive)

    return np.unique(loads).tolist()

save_confusion_matrix(conf_mtx, name_class, vmax, dir_path)

Save a confusion matrix plot to a file.

This function saves a confusion matrix plot, based on the provided confusion matrix and class names, to a file in the specified directory. The plot includes the color scale, with the maximum value set by the 'vmax' parameter.

Parameters:

Name Type Description Default
conf_mtx ndarray

The confusion matrix data to be plotted.

required
name_class ndarray

The array of class names corresponding to the matrix.

required
vmax int

The maximum value for the color scale in the plot.

required
dir_path str

The path to the directory where the plot file will be saved.

required

Returns:

Type Description

Message class

Source code in python/gramep/data_io.py
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
def save_confusion_matrix(
    conf_mtx: np.ndarray,
    name_class: np.ndarray,
    vmax: int,
    dir_path: str,
):
    """
    Save a confusion matrix plot to a file.

    This function saves a confusion matrix plot, based on the provided \
    confusion matrix and class names, to a file in the specified directory. \
    The plot includes the color scale, with the maximum value set \
    by the 'vmax' parameter.

    Args:
        conf_mtx (np.ndarray): The confusion matrix data to be plotted.
        name_class (np.ndarray): The array of class names corresponding to the matrix.
        vmax (int): The maximum value for the color scale in the plot.
        dir_path (str): The path to the directory where the plot file will be saved.

    Returns:
        Message class
    """

    path_dir = str(Path(dir_path).parent) + '/classify/results'
    path = Path(path_dir)
    path.mkdir(mode=0o777, parents=True, exist_ok=True)
    text = 'Heatmap for Random Forest\nClassification Algorithm'
    save_name = '/confusion_matrix.pdf'
    fmt = 'd'
    figsize = (20, 15)

    cm_fig = plt.figure(figsize=figsize)
    axes = plt.axes()
    # x_axis_labels = name_class
    # y_axis_labels = name_class
    heatmap(
        conf_mtx,
        vmin=0,
        vmax=vmax,
        annot=True,
        fmt=fmt,
        ax=axes,
        xticklabels=name_class,
        yticklabels=name_class,
    )
    axes.set_title(text, fontsize=20, pad=15)
    cm_fig.savefig(path_dir + save_name, dpi=300)

    return message.result_confusion_matrix(path_dir)

save_data(data_frame, class_names_to_save, dir_path)

Save data to files in the specified directory.

This function saves the provided data frame and class names array to files in the specified directory. The data frame is saved in CSV format, and the class names array is saved as a text file.

Parameters:

Name Type Description Default
data_frame DataFrame

The pandas DataFrame to be saved.

required
class_names_to_save ndarray

The array of class names to be saved.

required
dir_path str

The path to the directory where files will be saved.

required

Returns:

Type Description

Message class

Source code in python/gramep/data_io.py
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
def save_data(
    data_frame: pd.DataFrame, class_names_to_save: np.ndarray, dir_path: str
):
    """
    Save data to files in the specified directory.

    This function saves the provided data frame and class names array to \
    files in the specified
    directory. The data frame is saved in CSV format, and the class names array \
    is saved as a text file.

    Args:
        data_frame (pd.DataFrame): The pandas DataFrame to be saved.
        class_names_to_save (np.ndarray): The array of class names to be saved.
        dir_path (str): The path to the directory where files will be saved.

    Returns:
        Message class
    """

    path_dir = str(Path(dir_path).parent) + '/classify/results'
    path = Path(path_dir)
    path.mkdir(mode=0o777, parents=True, exist_ok=True)

    data_frame['CLASS'] = class_names_to_save
    data_frame.to_csv(path_dir + '/dataframe.csv', encoding='utf-8')
    return message.info_dataframe_saved(path_dir)

save_diffs_positions(sequence_path, ref_path, variations, save_path)

Save variations and their positions to a file.

This function takes a sequence data file path, a list of variations, and a save path. It writes the variations and their positions to the specified file in multiple formats.

Parameters:

Name Type Description Default
sequence_path str

The path to the sequence data file.

required
ref_path str

The reference sequence.

required
variations list[str]

A list of variations.

required
save_path str

The path to save the variations and their positions.

required

Returns:

Type Description

Message class

Source code in python/gramep/data_io.py
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
def save_diffs_positions(
    sequence_path: str,
    ref_path: str,
    variations: list[str],
    save_path: str,
):
    """
    Save variations and their positions to a file.

    This function takes a sequence data file path, a list of variations, and a \
    save path.
    It writes the variations and their positions to the specified file in \
        multiple formats.

    Args:
        sequence_path (str): The path to the sequence data file.
        ref_path (str): The reference sequence.
        variations (list[str]): A list of variations.
        save_path (str): The path to save the variations and their positions.

    Returns:
        Message class
    """
    seq_name = sequence_path.split('/')[-1].split('.')[0]

    # Check if path exists
    dirPath = save_path + '/' + seq_name
    path = Path(dirPath)
    path.mkdir(mode=0o777, parents=True, exist_ok=True)
    save_file_name = dirPath + '/' + seq_name + '_variations.sav'
    with open(save_file_name, 'wb') as exclusive_kmers_file:
        pickle.dump(variations, exclusive_kmers_file)
    save_file_name = dirPath + '/' + seq_name + '_variations.txt'
    with open(save_file_name, 'w') as exclusive_kmers_file:
        exclusive_kmers_file.write(str(variations))
    save_file_name = dirPath + '/' + seq_name + '_variations.bed3'
    with open(save_file_name, 'w') as exclusive_kmers_file:
        for variation in variations:
            position, mutation = variation.split(':')
            exclusive_kmers_file.write(
                str(
                    str(seq_name)
                    + '\t'
                    + str(position)
                    + '\t'
                    + str(int(position) + 1)
                    + '\n'
                )
            )

    save_file_name = dirPath + '/' + seq_name + '_reference.fasta'

    write_ref(
        ref_path=ref_path, variations=variations, save_path=save_file_name
    )

    return message.info_kmers_saved(dirPath)

save_exclusive_kmers(sequence_path, seq_kmers_exclusive, save_path)

Save exclusive k-mers to a file.

This function takes a sequence data file path, a list of exclusive k-mers, and a save path. It writes the exclusive k-mers to the specified file.

Parameters:

Name Type Description Default
sequence_path str

The path to the sequence data file.

required
seq_kmers_exclusive list[str]

A list of exclusive k-mers.

required
save_path str

The path to save the exclusive k-mers.

required

Returns:

Type Description

Message class

Source code in python/gramep/data_io.py
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
def save_exclusive_kmers(
    sequence_path: str,
    seq_kmers_exclusive: list[str],
    save_path: str,
):
    """
    Save exclusive k-mers to a file.

    This function takes a sequence data file path, a list of exclusive k-mers,
    and a save path. It writes the exclusive k-mers to the specified file.

    Args:
        sequence_path (str): The path to the sequence data file.
        seq_kmers_exclusive (list[str]): A list of exclusive k-mers.
        save_path (str): The path to save the exclusive k-mers.

    Returns:
        Message class
    """
    seq_name = sequence_path.split('/')[-1].split('.')[0]
    # Check if path exists
    dirPath = save_path + '/' + seq_name
    path = Path(dirPath)
    path.mkdir(mode=0o777, parents=True, exist_ok=True)
    save_file_name = dirPath + '/' + seq_name + '_ExclusiveKmers.sav'
    with open(save_file_name, 'wb') as exclusive_kmers_file:
        pickle.dump(seq_kmers_exclusive, exclusive_kmers_file)
    save_file_name = dirPath + '/' + seq_name + '_ExclusiveKmers.txt'
    with open(save_file_name, 'w') as exclusive_kmers_file:
        exclusive_kmers_file.write(str(seq_kmers_exclusive))
    return message.info_kmers_saved(dirPath)

save_intersection_kmers(sequence_path, seq_kmers_intersections, save_path)

Save intersection k-mers to a file.

This function takes a sequence data file path, a list of intersection k-mers, and a save path. It writes the intersection k-mers to the specified file.

Parameters:

Name Type Description Default
sequence_path str

The path to the sequence data file.

required
seq_kmers_intersections list[str]

A list of intersection k-mers.

required
save_path str

The path to save the intersection k-mers.

required

Returns:

Type Description

Message class

Source code in python/gramep/data_io.py
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
def save_intersection_kmers(
    sequence_path: str,
    seq_kmers_intersections: list[str],
    save_path: str,
):
    """
    Save intersection k-mers to a file.

    This function takes a sequence data file path, a list of intersection k-mers,
    and a save path. It writes the intersection k-mers to the specified file.

    Args:
        sequence_path (str): The path to the sequence data file.
        seq_kmers_intersections (list[str]): A list of intersection k-mers.
        save_path (str): The path to save the intersection k-mers.

    Returns:
        Message class
    """
    seq_name = sequence_path.split('/')[-1].split('.')[0]
    # Check if path exists
    dirPath = save_path + '/' + seq_name
    path = Path(dirPath)
    path.mkdir(mode=0o777, parents=True, exist_ok=True)
    save_file_name = dirPath + '/' + seq_name + '_IntersectionKmers.sav'
    with open(save_file_name, 'wb') as exclusive_kmers_file:
        pickle.dump(seq_kmers_intersections, exclusive_kmers_file)
    save_file_name = dirPath + '/' + seq_name + '_IntersectionKmers.txt'
    with open(save_file_name, 'w') as exclusive_kmers_file:
        exclusive_kmers_file.write(str(seq_kmers_intersections))

    return message.info_kmers_saved(dirPath)

save_metrics(acc, metrics, dir_path)

Save accuracy and metrics data to a file.

This function saves the provided accuracy and metrics data to a file in the specified directory. The saved data can be used for analysis and reporting.

Parameters:

Name Type Description Default
acc str

The accuracy data to be saved.

required
metrics str

The metrics data to be saved.

required
dir_path str

The path to the directory where the data file will be saved.

required

Returns:

Type Description

Message class

Source code in python/gramep/data_io.py
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
def save_metrics(acc: str, metrics: str, dir_path: str):
    """
    Save accuracy and metrics data to a file.

    This function saves the provided accuracy and metrics data to a file in \
    the specified directory.
    The saved data can be used for analysis and reporting.

    Args:
        acc (str): The accuracy data to be saved.
        metrics (str): The metrics data to be saved.
        dir_path (str): The path to the directory where the data file will be saved.

    Returns:
        Message class
    """

    path_dir = str(Path(dir_path).parent) + '/classify/results'
    path = Path(path_dir)
    path.mkdir(mode=0o777, parents=True, exist_ok=True)

    file_name = path_dir + '/metrics.txt'
    with open(file_name, 'a') as metrics_file:
        metrics_file.write(
            'GRAMEP - Genome vaRiation Analysis from the Maximum Entropy Principle.\n'
        )
        metrics_file.write('Results from validation\n')
        metrics_file.write('Accuracy:')
        metrics_file.write(acc)
        metrics_file.write('\nMetrics:\n')
        metrics_file.write(metrics)
    return message.info_metrics_saved(path_dir)

save_model(model, dir_path)

Save a RandomForestClassifier model to a file.

This function saves the provided RandomForestClassifier model to a file in the specified directory. The saved model can be loaded and used for predictions later.

Parameters:

Name Type Description Default
model RandomForestClassifier

The RandomForestClassifier model to be saved.

required
dir_path str

The path to the directory where the model file will be saved.

required

Returns:

Type Description

Message class

Source code in python/gramep/data_io.py
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
def save_model(model: RandomForestClassifier, dir_path: str):
    """
    Save a RandomForestClassifier model to a file.

    This function saves the provided RandomForestClassifier model to a \
    file in the specified
    directory. The saved model can be loaded and used for predictions later.

    Args:
        model (RandomForestClassifier): The RandomForestClassifier model to be saved.
        dir_path (str): The path to the directory where the model file will be saved.

    Returns:
        Message class
    """

    path_dir = str(Path(dir_path).parent) + '/classify/model'
    path = Path(path_dir)
    path.mkdir(mode=0o777, parents=True, exist_ok=True)

    file_name = path_dir + '/model.sav'
    with open(file_name, 'wb') as model_file:
        pickle.dump(model, model_file)
    return message.info_model_saved(path_dir)

save_predict_data(id_values, predicted_data, dir_path)

Save predicted data to a CSV file.

This function saves the provided 'ID' values and predicted data as a CSV file in the specified directory. It returns a message confirming the successful saving of the predictions.

Parameters:

Name Type Description Default
id_values Series

A pandas Series containing 'ID' values.

required
predicted_data Series

A pandas Series containing predicted data.

required
dir_path str

The path to the directory for saving the CSV file.

required

Returns:

Name Type Description
str str

A message confirming the successful saving of the predictions.

Source code in python/gramep/data_io.py
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
def save_predict_data(
    id_values: pd.Series, predicted_data: pd.Series, dir_path: str
) -> str:
    """
    Save predicted data to a CSV file.

    This function saves the provided 'ID' values and predicted data as a CSV \
    file in the specified
    directory. It returns a message confirming the successful saving of the predictions.

    Args:
        id_values (pd.Series): A pandas Series containing 'ID' values.
        predicted_data (pd.Series): A pandas Series containing predicted data.
        dir_path (str): The path to the directory for saving the CSV file.

    Returns:
        str: A message confirming the successful saving of the predictions.
    """

    pd.DataFrame({'ID': id_values, 'Predicted': predicted_data}).to_csv(
        dir_path + '/predict_data.csv', encoding='utf-8', index=False
    )
    message.info_done()
    return message.info_predictions_saved(dir_path)

save_ranges(ranges, dir_path)

Save MinMaxScaler object ranges to a file.

This function saves the MinMaxScaler object's data range information to a file in the specified directory. The saved data can be used for scaling operations later.

Parameters:

Name Type Description Default
ranges MinMaxScaler

The MinMaxScaler object containing data range information.

required
dir_path str

The path to the directory where the data range file will be saved.

required

Returns:

Type Description

None

Source code in python/gramep/data_io.py
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
def save_ranges(ranges: MinMaxScaler, dir_path: str):
    """
    Save MinMaxScaler object ranges to a file.

    This function saves the MinMaxScaler object's data range information to a file
    in the specified directory. The saved data can be used for scaling operations later.

    Args:
        ranges (MinMaxScaler): The MinMaxScaler object containing data \
        range information.
        dir_path (str): The path to the directory where the data range \
        file will be saved.

    Returns:
        None
    """

    path_dir = str(Path(dir_path).parent) + '/classify/model'
    path = Path(path_dir)
    path.mkdir(mode=0o777, parents=True, exist_ok=True)

    file_name = path_dir + '/ranges.sav'
    with open(file_name, 'wb') as ranges_file:
        pickle.dump(ranges, ranges_file)
    return message.info_ranges_saved(path_dir)

write_frequencies(freq_kmers, sequence_path, save_path)

Write k-mer frequencies to a file.

This function takes a defaultdict containing k-mer frequencies, a sequence data file path, and a save path. It writes the k-mer frequencies to the specified file.

Parameters:

Name Type Description Default
freq_kmers defaultdict

A defaultdict containing k-mer frequencies.

required
sequence_path str

The path to the sequence data file.

required
save_path str

The path to save the k-mer frequencies.

required

Returns:

Type Description

Message class

Source code in python/gramep/data_io.py
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
def write_frequencies(
    freq_kmers: defaultdict, sequence_path: str, save_path: str
):
    """
    Write k-mer frequencies to a file.

    This function takes a defaultdict containing k-mer frequencies, a sequence \
    data file path,
    and a save path. It writes the k-mer frequencies to the specified file.

    Args:
        freq_kmers (defaultdict): A defaultdict containing k-mer frequencies.
        sequence_path (str): The path to the sequence data file.
        save_path (str): The path to save the k-mer frequencies.

    Returns:
        Message class
    """

    seq_name = sequence_path.split('/')[-1].split('.')[0]
    # Check if path exists
    dirPath = save_path + '/' + seq_name
    path = Path(dirPath)
    path.mkdir(mode=0o777, parents=True, exist_ok=True)
    save_file_name = dirPath + '/' + seq_name + '_FreqExclusiveKmers.csv'

    with open(save_file_name, 'a') as out_file:
        out_file.write('position;reference_value;variant_value;frequency\n')
        for key, value in freq_kmers.items():
            position, var = key.split(':')
            out_file.write(
                str(
                    str(position)
                    + ';'
                    + str(var[0])
                    + ';'
                    + str(var[1])
                    + ';'
                    + str(value)
                    + '\n'
                )
            )

    return message.info_freq_saved(dirPath)

write_report(report, sequence_path, save_path)

Write a report to a file.

This function takes a report as a NumPy array, a sequence data file path, and a save path. It writes the report to the specified file.

Parameters:

Name Type Description Default
report ndarray

A NumPy array representing the report.

required
sequence_path str

The path to the sequence data file.

required
save_path str

The path to save the report.

required

Returns:

Type Description

Message class

Source code in python/gramep/data_io.py
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
def write_report(report: np.ndarray, sequence_path: str, save_path: str):
    """
    Write a report to a file.

    This function takes a report as a NumPy array, a sequence data file path,
    and a save path. It writes the report to the specified file.

    Args:
        report (np.ndarray): A NumPy array representing the report.
        sequence_path (str): The path to the sequence data file.
        save_path (str): The path to save the report.

    Returns:
        Message class
    """
    seq_name = sequence_path.split('/')[-1].split('.')[0]
    # Check if path exists
    dirPath = save_path + '/' + seq_name
    path = Path(dirPath)
    path.mkdir(mode=0o777, parents=True, exist_ok=True)
    save_file_name = dirPath + '/' + seq_name + '_report.csv'

    with open(save_file_name, 'a') as out_file:
        out_file.write(
            'sequence_id;annotation_name;start;end;type;modification_localization_in_reference;reference_kmer;exclusive_variant_kmer;reference_snp;variant_snp\n'
        )
    pd.Series(list(report)).to_csv(
        save_file_name, header=False, index=False, mode='a'
    )

    return message.info_report_saved(dirPath)