Skip to content

Instantly share code, notes, and snippets.

@derrick-daniel
Last active December 22, 2023 15:04
Show Gist options
  • Save derrick-daniel/1e636c69400ff937a37cec9e595c828c to your computer and use it in GitHub Desktop.
Save derrick-daniel/1e636c69400ff937a37cec9e595c828c to your computer and use it in GitHub Desktop.
This script is tailored for the manipulation of the Comprehensive Antibiotic Resistance Database (CARD). It functions by ingesting data from a JSON file, selectively extracting pertinent details, and then reformatting them to align with the Abricate CARD DB structure. This process is essential for updating the Abricate CARD DB.
import json
def format_sequence(sequence):
return '\n'.join([sequence[i:i+60] for i in range(0, len(sequence), 60)])
def process_category_name(name):
words = name.split()
if words[-1].lower() == 'antibiotic':
return ' '.join(words[:-1])
return ' '.join(words)
def main():
input_file = 'card-data/card.json' # Path to the CARD JSON file
output_file = 'sequences' # Output file name
with open(input_file, 'r') as file:
card_data = json.load(file)
with open(output_file, 'w') as out:
for key, model in card_data.items():
model_name = model['model_name']
for seq_key, seq_data in model['model_sequences']['sequence'].items():
dna_seq = seq_data['dna_sequence']
accession = dna_seq['accession'].split('.')[0]
fmin = dna_seq['fmin']
fmax = dna_seq['fmax']
sequence = format_sequence(dna_seq['sequence'])
drug_classes = ';'.join([process_category_name(model['ARO_category'][cat_key]['category_aro_name'])
for cat_key in model['ARO_category']
if model['ARO_category'][cat_key]['category_aro_class_name'] == "Drug Class"])
description = model['ARO_description']
formatted_entry = f">card~~~{model_name}~~~{accession}:{fmin}-{fmax}~~~{drug_classes} {description}\n{sequence}\n"
out.write(formatted_entry)
if __name__ == "__main__":
main()
@derrick-daniel
Copy link
Author

This Python script is designed for processing the Comprehensive Antibiotic Resistance Database (CARD). It reads data from a JSON file, extracts relevant information, and formats it for better readability and analysis.

Key Features:

  1. format_sequence function: Breaks down DNA sequences into readable 60-character lines.
  2. process_category_name function: Refines the category names by removing the word 'antibiotic', if present, to streamline the drug class names.
  3. Main processing loop: Iterates through the CARD data entries, formatting and writing each entry's relevant data to an output file. This includes the model name, DNA sequence, accession numbers, feature min and max, and the drug classes.

The script requires a JSON file of the CARD database as input. The updated CARD database can be downloaded from the official CARD website or the following link:
Download Updated CARD Database

To run the script, place the downloaded CARD JSON file in the specified directory and adjust the input_file path accordingly in the script. The output will be saved in a text file, formatted for easy viewing and further analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment