I started with Sheikh AbdurRashid Sufi’s recitation in Hafs from the Qatar Quran Broadcast channel, located here.
download by using scdl:
scdl -l https://soundcloud.com/abdulrashidsufi/sets/hafs2023
the files are named like:
الشيخ عبد الرشيد شيخ علي عبد الرحمن صوفي - المصحف المرتل برواية حفص عن عاصم 2023_من أول سورة الحجر إلى آخر سورة النحل - رواية حفص عن عاصم.mp3
rename the files using aichat
+ claude 3.5 sonnet
:
ls *.mp3 > output
aichat -r shell -f output "Given this list of files, rename them into English describing them - example sura_tawba_5_through_93.mp3. Ideally, order them also in Quran order, meaning make the first file 01_sura_fatiha_to_baqarah_141.mp3"
convert to wav
using ffmpeg
:
for i in *.mp3; do
ffmpeg -i $i -ar 16000 -ac 1 -c:a pcm_s16le wav/${i:r}.wav
done
transcribe using whisper-cpp:
for i in ~/Desktop/sheikh_abdurrashid_sufi_hafs/wav/*.wav; do
./main -m ~/Documents/whisper/ggml-large-v2.bin -l ar -ocsv -of ${i:h}/transcript/${i:t} $i
done
Note that transcription is good, but in some cases, the timestamps are off by a second or so in one direction or the other, requiring some manual adjustments.
The intention of this step is to generate a json file containing:
[
{ "sura": <number>, "start_ayah": <int>, "end_ayah": <int>, "start": <time>, "end", <time> },
// ...
]
Based on the data. We’d have multiple rows for suras that span multiple files.
I did the “actual” approach in the section below, but it had multiple problems that had to be resolved along the way. In retrospect (and/or if I need to do this again at some point), I’d likely use the following approach -
Generate a script that walks through the files searching for “بسم”, and, in conjunction with the filename, we have enough information to generate each row. For example, we start at sura 1, and the next basmallah we encounter is sura 2. When the file ends and we enter the next file, we see from the file name (02_sura_baqarah_142_to_252.wav.csv
) that it starts from ayah 142 to 252, so we can update the files accordingly.
The caveats are:
- sura Tawbah (search for isti3atha instead)
- sura Naml (has a basmallah that isn’t a sura separator)
Also keep in mind that if a sura spans multiple files, do not add the basmala for the n > 1 file pieces.
generate an aichat
prompt as follows in ~/.config/aichat/roles/quran_ordered_audio.md
:
I will give you an input file with a transcript of some verses from the recitation of the Quran. This file may have other audio information before hand (ex an introduction, a copyright at the end, etc). I would like you to process the passed file and return one or more json blobs of data, representing:
{ "sura": <number>, "start_ayah": <int>, "end_ayah": <int>, "start": <time>, "end",<time> }
For example, if the transcript file passed in contains the following:
\```
0,4560," ﻩﺬﻫ ﺇﺫﺎﻋﺓ ﺎﻠﻗﺭﺂﻧ ﺎﻠﻛﺮﻴﻣ ﻢﻧ ﺎﻟﺩﻮﺣﺓ"
4560,7240," ﺏﺭﻭﺎﻳﺓ ﺢﻔﺻ ﻊﻧ ﻉﺎﺼﻣ"
7240,12360," ﻱﺮﺘﻟ ﻊﻠﻴﻧﺍ ﺎﻠﻗﺍﺮﺋ ﺎﻠﺸﻴﺧ ﻊﺑﺩ ﺎﻟﺮﺸﻳﺩ ﺎﻠﺻﻮﻔﻳ"
12360,14760," ﻡﺍ ﻲﺘﻴﺳﺭ ﻞﻫ"
14760,17080," ﻢﻧ ﺱﻭﺭﺓ ﺎﻠﻓﺎﺘﺣﺓ"
17080,22840," ﺈﻟﻯ ﺍﻶﻳﺓ ١٠٤١ ﻢﻧ ﺱﻭﺭﺓ ﺎﻠﺒﻗﺭﺓ"
22840,27440," ﺄﻋﻭﺫ ﺏﺎﻠﻠﻫ ﻢﻧ ﺎﻠﺸﻴﻃﺎﻧ ﺎﻟﺮﺠﻴﻣ"
27440,31640," ﺐﺴﻣ ﺎﻠﻠﻫ ﺎﻟﺮﺤﻤﻧ ﺎﻟﺮﺤﻴﻣ"
31640,36400," ﺎﻠﺤﻣﺩ ﻞﻠﻫ ﺮﺑ ﺎﻠﻋﺎﻠﻤﻴﻧ"
36400,39920," ﺎﻟﺮﺤﻤﻧ ﺎﻟﺮﺤﻴﻣ"
39920,43360," ﻡﺎﻠﻛ ﻱﻮﻣ ﺎﻟﺪﻴﻧ"
43360,48720," ﺈﻳﺎﻛ ﻦﻌﺑﺩ ﻭﺈﻳﺎﻛ ﻦﺴﺘﻌﻴﻧ"
\```
you would output:
\```json
{ "sura": 1, "start_ayah": 1, "end_ayah": 7, "start": 27440, "end": 66480 },
{ "sura": 2, "start_ayah": 1, "end_ayah": 1, "start": 66480, "end": 77600 },
\```
Please only output the json lines as above and nothing else, since I plan on
stringing the output of your program with other programs.
note - in retrospect, it would have made my life slightly easier to include a file
field in each row with the file name. using the approach above, we end up having to “guess” it instead based on the timing. I add the file as the next step, which you can ignore if you have done this.
now run it using shell, noting that this costs ~$1.53 and is a bit slow - see the section above for “ideal approach.”
for i in *.csv; do
aichat -f $i -r quran_ordered_audio
# sleep required to avoid hitting rate limits
sleep 15
done
sadly, it still sometimes outputs:
Here are the JSON blobs representing the Quran verses in the transcript: and sometimes it pretty prints the json as well. copy these into a single file.
paste the output data into a file and make into a valid json file - timings.json
. just some find and replace to change } to }, and adding the [ and ] around the set of them.
note - we can spot check some of the audio data using:
brew install mpv
ffmpeg -i 01_sura_fatiha_to_baqarah_141.mp3 -ss 66.48 -f wav - | mpv --no-video -
note - I’ve found some mistakes with the timings in juz’ 26 and juz’ 29, so it’s worth checking these first. It’s likely that the alternative approach of searching for basmallahs would be more accurate. I also found other mistakes as time went on (parts where it cut at a wrong place, etc). This is aside from actual transcription timestamps being a little bit off in some cases.
if you need to rename the files, you can use:
# with zsh
autoload zmv
zmv '([0-9]##)_*.mp3' '$1.mp3'
Generated using Claude Sonnet using this prompt:
I have a json file that looks like this:
\```json
[
{ "sura": 1, "start_ayah": 1, "end_ayah": 7, "start": 27440, "end": 66480 },
{ "sura": 2, "start_ayah": 1, "end_ayah": 141, "start": 66480, "end": 2799920 },
{ "sura": 2, "start_ayah": 142, "end_ayah": 252, "start": 33560, "end": 2812420 },
{ "sura": 2, "start_ayah": 253, "end_ayah": 286, "start": 30600, "end": 1091500 },
{ "sura": 3, "start_ayah": 1, "end_ayah": 92, "start": 1091500, "end": 2766620 }
]
\```
Note that start and end are in miliseconds.
I'd like to add a new field to each line with the filename. The filename should be something between 01.mp3 and 30.mp3, incrementally speaking. The trick is that there are more than 30 rows, however, and so the filename needs to be accurately chosen based on the data. For example, we'll start at 01.mp3.
When we reach the second row, we see that start is greater or equal to the end of the previous item, so we assume it remains the same filename. In the third line, we see that start is less than the end of the previous item, so we swap the file name. In the fourth file, this happens again, so yet another file name. In the 5th line, the same file name remains since the start is equal to the end of the previous line.
It output the following script:
import json
def assign_filenames(data):
current_file_number = 1
current_end = 0
for item in data:
if item['start'] < current_end:
current_file_number += 1
item['filename'] = f"{current_file_number:02d}.mp3"
current_end = item['end']
return data
# Read the JSON file
with open('input.json', 'r') as file:
data = json.load(file)
# Assign filenames
updated_data = assign_filenames(data)
# Write the updated JSON to a new file
with open('output.json', 'w') as file:
json.dump(updated_data, file, indent=2)
print("Processing complete. Check output.json for the result.")
Now that we have mp3s, we need a script to generate the mp3s. I used this prompt to Claude Sonnet:
I have a json file representing timings of suras in the Quran. The json file
contains entries, each of which looks like:
\```json
[
{ "sura": 1, "start_ayah": 1, "end_ayah": 7, "start": 27440, "end": 66480, "filename":"01.mp3" },
{ "sura": 2, "start_ayah": 1, "end_ayah": 141, "start": 66480, "end": 2799920,"filename": "01.mp3" },
{ "sura": 2, "start_ayah": 142, "end_ayah": 252, "start": 33560, "end": 2812420,"filename": "02.mp3" },
{ "sura": 2, "start_ayah": 253, "end_ayah": 286, "start": 30600, "end": 1091500,"filename": "03.mp3" },
{ "sura": 3, "start_ayah": 1, "end_ayah": 92, "start": 1091500, "end": 2766620,"filename": "03.mp3" },
// ...
\```
The filenames here are in order from 01 to 30. A sura could span multiple rows. I'd like to generate 114 mp3s from this data, one for each sura. Please read the json and generate mp3s, respecting the "start" and "end" timings for each entry. For example, in the case above, we'd make a single mp3 for sura 1 using the timestamps 27440 and 66480.
The next 3 entries represent sura 2. We'd make a single mp3 for sura 2 using the timestamps 66480 and 2799920 from 01.mp3, 33560 through 2812420 from 02.mp3, and 30600 through 1091500 from 03.mp3. We'd concatenate these into one mp3, and name this output file as 002.mp3.
I spent a lot of back and forth after that, since the various ffmpeg
commands with filters didn’t work. The final appraoch that worked (with minor updates to change where the file got output) looked like this:
import json
import subprocess
import os
import tempfile
import shutil
# Read the JSON file
with open('timings.json', 'r') as f:
data = json.load(f)
# Group entries by sura
suras = {}
for entry in data:
sura = entry['sura']
if sura not in suras:
suras[sura] = []
suras[sura].append(entry)
# Create a temporary directory
temp_dir = os.path.join(os.getcwd(), 'tmp')
os.makedirs(temp_dir, exist_ok=True)
# Process each sura
for sura, entries in suras.items():
output_filename = f"{sura:03d}.mp3"
print(f"Processing Sura {sura}...")
segment_files = []
for i, entry in enumerate(entries):
start = entry['start'] / 1000 # Convert to seconds
duration = (entry['end'] - entry['start']) / 1000 # Convert to seconds
input_file = os.path.abspath(entry['filename'])
segment_file = os.path.join(temp_dir, f'segment_{sura}_{i}.mp3')
# Extract segment
extract_command = [
'ffmpeg',
'-i', input_file,
'-ss', str(start),
'-t', str(duration),
'-c', 'copy',
segment_file
]
try:
subprocess.run(extract_command, check=True, capture_output=True)
segment_files.append(segment_file)
except subprocess.CalledProcessError as e:
print(f"Error extracting segment for Sura {sura}, part {i+1}: {e}")
print(f"Command was: {' '.join(extract_command)}")
continue
# Prepare concat file
concat_file = os.path.join(temp_dir, f'concat_{sura}.txt')
with open(concat_file, 'w') as f:
for segment in segment_files:
f.write(f"file '{segment}'\n")
# Concatenate segments
concat_command = [
'ffmpeg',
'-f', 'concat',
'-safe', '0',
'-i', concat_file,
'-c', 'copy',
os.path.join(temp_dir, output_filename)
]
try:
subprocess.run(concat_command, check=True, capture_output=True)
print(f"Successfully created {output_filename}")
# Move the output file to the current directory
shutil.move(os.path.join(temp_dir, output_filename), os.path.join("output", output_filename))
except subprocess.CalledProcessError as e:
print(f"Error concatenating segments for Sura {sura}: {e}")
print(f"Command was: {' '.join(concat_command)}")
# Clean up segment files and concat file
for segment in segment_files:
os.remove(segment)
os.remove(concat_file)
# Remove temporary directory
shutil.rmtree(temp_dir)
print("Processing complete.")
#quran/audio