Last active
December 22, 2023 15:04
-
-
Save derrick-daniel/1e636c69400ff937a37cec9e595c828c to your computer and use it in GitHub Desktop.
This script is tailored for the manipulation of the Comprehensive Antibiotic Resistance Database (CARD). It functions by ingesting data from a JSON file, selectively extracting pertinent details, and then reformatting them to align with the Abricate CARD DB structure. This process is essential for updating the Abricate CARD DB.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import json | |
def format_sequence(sequence): | |
return '\n'.join([sequence[i:i+60] for i in range(0, len(sequence), 60)]) | |
def process_category_name(name): | |
words = name.split() | |
if words[-1].lower() == 'antibiotic': | |
return ' '.join(words[:-1]) | |
return ' '.join(words) | |
def main(): | |
input_file = 'card-data/card.json' # Path to the CARD JSON file | |
output_file = 'sequences' # Output file name | |
with open(input_file, 'r') as file: | |
card_data = json.load(file) | |
with open(output_file, 'w') as out: | |
for key, model in card_data.items(): | |
model_name = model['model_name'] | |
for seq_key, seq_data in model['model_sequences']['sequence'].items(): | |
dna_seq = seq_data['dna_sequence'] | |
accession = dna_seq['accession'].split('.')[0] | |
fmin = dna_seq['fmin'] | |
fmax = dna_seq['fmax'] | |
sequence = format_sequence(dna_seq['sequence']) | |
drug_classes = ';'.join([process_category_name(model['ARO_category'][cat_key]['category_aro_name']) | |
for cat_key in model['ARO_category'] | |
if model['ARO_category'][cat_key]['category_aro_class_name'] == "Drug Class"]) | |
description = model['ARO_description'] | |
formatted_entry = f">card~~~{model_name}~~~{accession}:{fmin}-{fmax}~~~{drug_classes} {description}\n{sequence}\n" | |
out.write(formatted_entry) | |
if __name__ == "__main__": | |
main() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This Python script is designed for processing the Comprehensive Antibiotic Resistance Database (CARD). It reads data from a JSON file, extracts relevant information, and formats it for better readability and analysis.
Key Features:
format_sequence
function: Breaks down DNA sequences into readable 60-character lines.process_category_name
function: Refines the category names by removing the word 'antibiotic', if present, to streamline the drug class names.The script requires a JSON file of the CARD database as input. The updated CARD database can be downloaded from the official CARD website or the following link:
Download Updated CARD Database
To run the script, place the downloaded CARD JSON file in the specified directory and adjust the
input_file
path accordingly in the script. The output will be saved in a text file, formatted for easy viewing and further analysis.