Kokoro-82M-v1.0 Performance Benchmark

Introduction

Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.

Source: https://huggingface.co/hexgrad/Kokoro-82M

We conducted a series of experiments to benchmark the performance of the Kokoro model across various runtimes and hardware configurations. Our primary motivation is to identify the most efficient and cost-effective setup for running the model, ensuring optimal performance and resource utilization.

Methodology

We evaluate the performance of the Kokoro PyTorch model (https://huggingface.co/hexgrad/Kokoro-82M) and its ONNX counterpart (https://huggingface.co/onnx-community/Kokoro-82M-v1.0-ONNX). The experiments are categorized into CPU-only and GPU-accelerated setups.

Hardware and Environment

The benchmarking is conducted on AWS EC2 instances. A few types of GPUs are chosen for performance comparison: NVIDIA A10G, NVIDIA L4, and NVIDIA T4 Tensor Core. A compute-optimized CPU instance of a similar price-point is chosen for fair cost comparison.

Instance Type	vCPUs	CPU Model	Memory	GPU	AMI
g4dn.xlarge	4	Intel Xeon	16 GiB	NVIDIA T4	AWS DL AMI GPU PyTorch 2.5 (Ubuntu 22.04)
g5.xlarge	4	AMD EPYC 7R32	16 GiB	NVIDIA A10G	AWS DL AMI GPU PyTorch 2.5 (Ubuntu 22.04)
g6.xlarge	4	AMD EPYC 7R32	16 GiB	NVIDIA L4	AWS DL AMI GPU PyTorch 2.5 (Ubuntu 22.04)
c6a.8xlarge	32	AMD EPYC 7R32	64 GiB	N/A	Ubuntu 22.04 LTS

Evaluation Metrics

Real-Time Factor (RTF) is used as the main evaluation metric for the experiments. It is defined as the ratio of the duration of the resulting audio to the time taken to produce it, indicating processing efficiency.

Experimental Setup

Installation Procedure

We first install the required dependencies for the experimentation. There are two separate procedures due to the CUDA libraries being a dependency for the GPU acceleration, and the use of either onnxruntime or onnxruntime-gpu.

GPU Instances

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install -y espeak-ng git-lfs cudnn9-cuda-12

source activate pytorch
pip install torch kokoro onnxruntime-gpu soundfile

git clone https://huggingface.co/hexgrad/Kokoro-82M
cd Kokoro-82M
wget https://huggingface.co/onnx-community/Kokoro-82M-v1.0-ONNX/resolve/main/onnx/model.onnx

CPU Instances

sudo apt-get update
sudo apt-get install -y espeak-ng git-lfs python3-pip

pip install torch kokoro onnxruntime soundfile

git clone https://huggingface.co/hexgrad/Kokoro-82M
cd Kokoro-82M
wget https://huggingface.co/onnx-community/Kokoro-82M-v1.0-ONNX/resolve/main/onnx/model.onnx

Tweaks to the Model Inputs

To ensure a fair comparison, the PyTorch model was modified to align its input types exactly with those of the ONNX model. This involved updating the PyTorch model to expect int64 inputs, thereby transforming the phonemes into input IDs using the provided vocabulary before running them through the model.

--- a/kokoro/model.py
+++ b/kokoro/model.py
@@ -74,15 +74,11 @@ class KModel(torch.nn.Module):
     @torch.no_grad()
     def forward(
         self,
-        phonemes: str,
+        input_ids: torch.LongTensor,
         ref_s: torch.FloatTensor,
         speed: Number = 1,
-        return_output: bool = False # MARK: BACKWARD COMPAT
-    ) -> Union['KModel.Output', torch.FloatTensor]:
-        input_ids = list(filter(lambda i: i is not None, map(lambda p: self.vocab.get(p), phonemes)))
-        logger.debug(f"phonemes: {phonemes} -> input_ids: {input_ids}")
-        assert len(input_ids)+2 <= self.context_length, (len(input_ids)+2, self.context_length)
-        input_ids = torch.LongTensor([[0, *input_ids, 0]]).to(self.device)
+    ) -> torch.FloatTensor:
+        input_ids = input_ids.to(self.device)
         input_lengths = torch.LongTensor([input_ids.shape[-1]]).to(self.device)
         text_mask = torch.arange(input_lengths.max()).unsqueeze(0).expand(input_lengths.shape[0], -1).type_as(input_lengths)
         text_mask = torch.gt(text_mask+1, input_lengths.unsqueeze(1)).to(self.device)
@@ -105,4 +101,4 @@ class KModel(torch.nn.Module):
         t_en = self.text_encoder(input_ids, input_lengths, text_mask)
         asr = t_en @ pred_aln_trg
         audio = self.decoder(asr, F0_pred, N_pred, ref_s[:, :128]).squeeze().cpu()
-        return self.Output(audio=audio, pred_dur=pred_dur.cpu()) if return_output else audio
+        return audio

Inference Data

We use an English adaptation of Who Am I?: The Teachings of Bhagavan Sri Ramana Maharshi, containing approximately 3,000 words equivalent to about 16,000 characters.

Tokenization

The G2P (Grapheme to Phoneme) library misaki is used to transcribe the English text, before converting it into integer tokens (i.e. input IDs) using the provided vocabulary.

Chunking

The model has a hard-limit requirement when it comes to input length, that is at 510 tokens. To efficiently process long sequences of input phonemes, we implemented a chunking methodology that divides the input into manageable segments based on punctuation. First, the input sequence is split into smaller chunks using delimiters; periods, question marks, and exclamation marks ( . , ? , and !). Then, these initial chunks are optimized by merging consecutive chunks while ensuring the combined length does not exceed the predefined limit (i.e. 510 tokens). This approach ensures that the input sequences are of optimal length for processing, thereby maintaining efficiency and performance during model inference. Although some other methods might be more semantically correct, depending on context, but we're more interested in performance benchmarking and this approach would probably yield a better performance.

Measurement

For each set up, we run 5 iterations generating the audio, and measure the duration it takes to run it through the model. The first iteration is disregarded due to warm-up/cold start, then the average is taken over the 4 remaining runs. The token generation and chunking process is not taken into account, as it is identical in all set ups and we are interested in the model's performance only.

Benchmarking Code

from misaki import en, espeak
from kokoro import KModel

import re
import json
import time
import torch

import soundfile as sf

import numpy as np
import onnxruntime as ort

import csv

text = """
1. Who am I ?
The gross body which is composed of the seven humours (dhatus), I am not; the five cognitive sense organs, viz. the senses of hearing, touch, sight, taste, and smell, which apprehend their respective objects, viz. sound, touch, colour, taste, and odour, I am not; the five cognitive sense- organs, viz. the organs of speech, locomotion, grasping, excretion, and procreation, which have as their respective functions speaking, moving, grasping, excreting, and enjoying, I am not; the five vital airs, prana, etc., which perform respectively the five functions of in-breathing, etc., I am not; even the mind which thinks, I am not; the nescience too, which is endowed only with the residual impressions of objects, and in which there are no objects and no functioning's, I am not.

2. If I am none of these, then who am I?
After negating all of the above-mentioned as 'not this', 'not this', that Awareness which alone remains - that I am.

3. What is the nature of Awareness?
The nature of Awareness is existence-consciousness-bliss

4. When will the realization of the Self be gained?
When the world which is what-is-seen has been removed, there will be realization of the Self which is the seer.

5. Will there not be realization of the Self even while the world is there (taken as real)?
There will not be.

6. Why?
The seer and the object seen are like the rope and the snake. Just as the knowledge of the rope which is the substrate will not arise unless the false knowledge of the illusory serpent goes, so the realization of the Self which is the substrate will not be gained unless the belief that the world is real is removed.

7. When will the world which is the object seen be removed?
When the mind, which is the cause of all cognition's and of all actions, becomes quiescent, the world will disappear.

8. What is the nature of the mind?
What is called 'mind' is a wondrous power residing in the Self. It causes all thoughts to arise. Apart from thoughts, there is no such thing as mind. Therefore, thought is the nature of mind. Apart from thoughts, there is no independent entity called the world. In deep sleep there are no thoughts, and there is no world. In the states of waking and dream, there are thoughts, and there is a world also. Just as the spider emits the thread (of the web) out of itself and again withdraws it into itself, likewise the mind projects the world out of itself and again resolves it into itself. When the mind comes out of the Self, the world appears. Therefore, when the world appears (to be real), the Self does not appear; and when the Self appears (shines) the world does not appear. When one persistently inquires into the nature of the mind, the mind will end leaving the Self (as the residue). What is referred to as the Self is the Atman. The mind always exists only in dependence on something gross; it cannot stay alone. It is the mind that is called the subtle body or the soul (jiva).

9. What is the path of inquiry for understanding the nature of the mind?
That which rises as 'I' in this body is the mind. If one inquires as to where in the body the thought 'I' rises first, one would discover that it rises in the heart. That is the place of the mind's origin. Even if one thinks constantly 'I' 'I', one will be led to that place. Of all the thoughts that arise in the mind, the 'I' thought is the first. It is only after the rise of this that the other thoughts arise. It is after the appearance of the first personal pronoun that the second and third personal pronouns appear; without the first personal pronoun there will not be the second and third.

10. How will the mind become quiescent?
By the inquiry 'Who am I?'. The thought 'who am I?' will destroy all other thoughts, and like the stick used for stirring the burning pyre, it will itself in the end get destroyed. Then, there will arise Self-realization.

11. What is the means for constantly holding on to the thought 'Who am I?'
When other thoughts arise, one should not pursue them, but should inquire: 'To whom do they arise?' It does not matter how many thoughts arise. As each thought arises, one should inquire with diligence, “To whom has this thought arisen?”. The answer that would emerge would be “To me”. Thereupon if one inquires “Who am I?”, the mind will go back to its source; and the thought that arose will become quiescent. With repeated practice in this manner, the mind will develop the skill to stay in its source. When the mind that is subtle goes out through the brain and the sense- organs, the gross names and forms appear; when it stays in the heart, the names and forms disappear. Not letting the mind go out, but retaining it in the Heart is what is called “inwardness” (antar- mukha). Letting the mind go out of the Heart is known as “externalisation” (bahir-mukha). Thus, when the mind stays in the Heart, the 'I' which is the source of all thoughts will go, and the Self which ever exists will shine. Whatever one does, one should do without the egoity “I”. If one acts in that way, all will appear as of the nature of Siva (God).

12. Are there no other means for making the mind quiescent?
Other than inquiry, there are no adequate means. If through other means it is sought to control the mind, the mind will appear to be controlled, but will again go forth. Through the control of breath also, the mind will become quiescent; but it will be quiescent only so long as the breath remains controlled, and when the breath resumes the mind also will again start moving and will wander as impelled by residual impressions. The source is the same for both mind and breath. Thought, indeed, is the nature of the mind. The thought “I” is the first thought of the mind; and that is egoity. It is from that whence egoity originates that breath also originates. Therefore, when the mind becomes quiescent, the breath is controlled, and when the breath is controlled the mind becomes quiescent. But in deep sleep, although the mind becomes quiescent, the breath does not stop. This is because of the will of God, so that the body may be preserved and other people may not be under the impression that it is dead. In the state of waking and in samadhi, when the mind becomes quiescent the breath is controlled. Breath is the gross form of mind. Till the time of death, the mind keeps breath in the body; and when the body dies the mind takes the breath along with it. Therefore, the exercise of breath-control is only an aid for rendering the mind quiescent (manonigraha); it will not destroy the mind (manonasa).

Like the practice of breath-control. meditation on the forms of God, repetition of mantras, restriction on food, etc., are but aids for rendering the mind quiescent.

Through meditation on the forms of God and through repetition of mantras, the mind becomes one- pointed. The mind will always be wandering. Just as when a chain is given to an elephant to hold in its trunk it will go along grasping the chain and nothing else, so also when the mind is occupied with a name or form it will grasp that alone. When the mind expands in the form of countless thoughts, each thought becomes weak; but as thoughts get resolved the mind becomes one-pointed and strong; for such a mind Self-inquiry will become easy. Of all the restrictive rules, that relating to the taking of sattvic food in moderate quantities is the best; by observing this rule, the sattvic quality of mind will increase, and that will be helpful to Self-inquiry.

13. The residual impressions (thoughts) of objects appear wending like the waves of an ocean. When will all of them get destroyed?
As the meditation on the Self rises higher and higher, the thoughts will get destroyed.

14. Is it possible for the residual impressions of objects that come from beginningless time, as it were, to be resolved, and for one to remain as the pure Self?
Without yielding to the doubt “Is it possible, or not?”, one should persistently hold on to the meditation on the Self. Even if one be a great sinner, one should not worry and weep “O! I am a sinner, how can I be saved?”; one should completely renounce the thought “I am a sinner”; and concentrate keenly on meditation on the Self; then, one would surely succeed. There are not two minds - one good and the other evil; the mind is only one. It is the residual impressions that are of two kinds - auspicious and inauspicious. When the mind is under the influence of auspicious impressions it is called good; and when it is under the influence of inauspicious impressions it is regarded as evil.

The mind should not be allowed to wander towards worldly objects and what concerns other people. However bad other people may be, one should bear no hatred for them. Both desire and hatred should be eschewed. All that one gives to others one gives to one's self. If this truth is understood who will not give to others? When one's self arises all arises; when one's self becomes quiescent all becomes quiescent. To the extent we behave with humility, to that extent there will result good. If the mind is rendered quiescent, one may live anywhere.

15. How long should inquiry be practised?
As long as there are impressions of objects in the mind, so long the inquiry “Who am I?” is required. As thoughts arise they should be destroyed then and there in the very place of their origin, through inquiry. If one resorts to contemplation of the Self unintermittently, until the Self is gained, that alone would do. As long as there are enemies within the fortress, they will continue to sally forth; if they are destroyed as they emerge, the fortress will fall into our hands.

16. What is the nature of the Self?
What exists in truth is the Self alone. The world, the individual soul, and God are appearances in it. like silver in mother-of-pearl, these three appear at the same time, and disappear at the same time. The Self is that where there is absolutely no “I” thought. That is called “Silence”. The Self itself is the world; the Self itself is “I”; the Self itself is God; all is Siva, the Self.

17. Is not everything the work of God?
Without desire, resolve, or effort, the sun rises; and in its mere presence, the sun-stone emits fire, the lotus blooms, water evaporates; people perform their various functions and then rest. Just as in the presence of the magnet the needle moves, it is by virtue of the mere presence of God that the souls governed by the three (cosmic) functions or the fivefold divine activity perform their actions and then rest, in accordance with their respective karmas. God has no resolve; no karma attaches itself to Him. That is like worldly actions not affecting the sun, or like the merits and demerits of the other four elements not affecting all pervading space.

18. Of the devotees, who is the greatest?
He who gives himself up to the Self that is God is the most excellent devotee. Giving one's self up to God means remaining constantly in the Self without giving room for the rise of any thoughts other than that of the Self. Whatever burdens are thrown on God, He bears them. Since the supreme power of God makes all things move, why should we, without submitting ourselves to it, constantly worry ourselves with thoughts as to what should be done and how, and what should not be done and how not? We know that the train carries all loads, so after getting on it why should we carry our small luggage on our head to our discomfort, instead of putting it down in the train and feeling at ease?

19. What is non-attachment?
As thoughts arise, destroying them utterly without any residue in the very place of their origin is non-attachment. Just as the pearl-diver ties a stone to his waist, sinks to the bottom of the sea and there takes the pearls, so each one of us should be endowed with non-attachment, dive within oneself and obtain the Self-Pearl.

20. Is it not possible for God and the Guru to effect the release of a soul?
God and the Guru will only show the way to release; they will not by themselves take the soul to the state of release. In truth, God and the Guru are not different. Just as the prey which has fallen into the jaws of a tiger has no escape, so those who have come within the ambit of the Guru's gracious look will be saved by the Guru and will not get lost; yet, each one should by his own effort pursue the path shown by God or Guru and gain release. One can know oneself only with one's own eye of knowledge, and not with somebody else's. Does he who is Rama require the help of a mirror to know that he is Rama?

21. Is it necessary for one who longs for release to inquire into the nature of categories (tattvas)?
Just as one who wants to throw away garbage has no need to analyse it and see what it is, so one who wants to know the Self has no need to count the number of categories or inquire into their characteristics; what he has to do is to reject altogether the categories that hide the Self. The world should be considered like a dream.

22. Is there no difference between waking and dream?
Waking is long and a dream short; other than this there is no difference. Just as waking happenings seem real while awake. so do those in a dream while dreaming. In dream the mind takes on another body. In both waking and dream states thoughts. names and forms occur simultaneously.

23. Is it any use reading books for those who long for release?
All the texts say that in order to gain release one should render the mind quiescent; therefore their conclusive teaching is that the mind should be rendered quiescent; once this has been understood there is no need for endless reading. In order to quieten the mind one has only to inquire within oneself what one's Self is; how could this search be done in books? One should know one's Self with one's own eye of wisdom. The Self is within the five sheaths; but books are outside them. Since the Self has to be inquired into by discarding the five sheaths, it is futile to search for it in books. There will come a time when one will have to forget all that one has learned.

24. What is happiness?
Happiness is the very nature of the Self; happiness and the Self are not different. There is no happiness in any object of the world. We imagine through our ignorance that we derive happiness from objects. When the mind goes out, it experiences misery. In truth, when its desires are fulfilled, it returns to its own place and enjoys the happiness that is the Self. Similarly, in the states of sleep, samadhi and fainting, and when the object desired is obtained or the object disliked is removed, the mind becomes inward-turned, and enjoys pure Self-Happiness. Thus the mind moves without rest alternately going out of the Self and returning to it. Under the tree the shade is pleasant; out in the open the heat is scorching. A person who has been going about in the sun feels cool when he reaches the shade. Someone who keeps on going from the shade into the sun and then back into the shade is a fool. A wise man stays permanently in the shade. Similarly, the mind of the one who knows the truth does not leave Brahman. The mind of the ignorant, on the contrary, revolves in the world, feeling miserable, and for a little time returns to Brahman to experience happiness. In fact, what is called the world is only thought. When the world disappears, i.e. when there is no thought, the mind experiences happiness; and when the world appears, it goes through misery.

25. What is wisdom-insight (jnana-drsti)?
Remaining quiet is what is called wisdom-insight. To remain quiet is to resolve the mind in the Self. Telepathy, knowing past, present and future happenings and clairvoyance do not constitute wisdom-insight.

26. What is the relation between desirelessness and wisdom?
Desirelessness is wisdom. The two are not different; they are the same. Desirelessness is refraining from turning the mind towards any object. Wisdom means the appearance of no object. In other words, not seeking what is other than the Self is detachment or desirelessness; not leaving the Self is wisdom.

27. What is the difference between inquiry and meditation?
Inquiry consists in retaining the mind in the Self. Meditation consists in thinking that one's self is Brahman, existence-consciousness-bliss.

28. What is release?
Inquiring into the nature of one's self that is in bondage, and realising one's true nature is release.
"""

# tokenization
fallback = espeak.EspeakFallback(british=False)
g2p = en.G2P(trf=False, british=False, fallback=fallback)
stripped_text = re.sub('\\s+', ' ', text).strip()
phonemes, _ = g2p(stripped_text)
vocab = json.load(open('config.json'))['vocab']
tokens = [vocab[p] for p in phonemes]

print(f'Total tokens: {len(tokens)}')

# generate chunks
delimiters = (vocab['.'], vocab['?'], vocab['!'])
chunks = []
s = []
chunk_length = 0
for t in tokens:
    s.append(t)
    if t in delimiters:
        chunks.append(s)
        s = []

# optimize chunks
i = 0
optimized_chunks = []
while i < len(chunks) - 1:
    j = 1
    chunk_length = len(chunks[i])
    c = chunks[i]
    while i + j < len(chunks) and chunk_length + len(chunks[i + j]) <= 510:
        chunk_length += len(chunks[i + j])
        c += chunks[i + j]
        j += 1

    optimized_chunks.append(c)
    i += j

# set up
devices = ['cuda', 'cpu']
rounds = 5

report = open('report.csv', 'a', newline='')
writer = csv.writer(report)

# ONNX
providers = {
    'cpu': 'CPUExecutionProvider',
    'cuda': 'CUDAExecutionProvider'
}
options = ort.SessionOptions()
options.intra_op_num_threads = 32
options.inter_op_num_threads = 32
voice = torch.load('voices/af_bella.pt').detach().cpu().numpy().astype(np.float32)
for device in devices:
    sess = ort.InferenceSession('model.onnx', providers=[providers[device]], sess_options=options)
    for r in range(rounds):
        print(f'Running ONNX on {device} / round {r+1}')
        audio_length = 0
        start_time = time.time()
        with sf.SoundFile(f'output_onnx_{device}_{r+1}.wav', mode='w', samplerate=24000, channels=1, format='WAV', subtype='PCM_16') as file:
            for chunk in optimized_chunks:
                style = voice[len(chunk)-1]
                tokens = np.array([[0, *chunk, 0]], dtype=np.int64)
                output = sess.run(None, dict(
                    input_ids=tokens,
                    style=style,
                    speed=np.ones(1, dtype=np.float32),
                ))[0]
                audio_length += len(output[0]) / 24000
                file.write(output[0])
        end_time = time.time()
        duration = end_time - start_time
        writer.writerow(['onnx', device, r+1, duration, audio_length, audio_length/duration])
        report.flush()

# Torch
torch.set_num_threads(32)
torch.set_num_interop_threads(32)
torch.inference_mode()
voice = torch.load('voices/af_bella.pt')
for device in devices:
    model_torch = KModel('config.json', 'kokoro-v1_0.pth').to(device).eval()
    voice = voice.to(device)
    for r in range(rounds):
        print(f'Running Torch on {device} / round {r+1}')
        audio_length = 0
        start_time = time.time()
        with sf.SoundFile(f'output_torch_{device}_{r+1}.wav', mode='w', samplerate=24000, channels=1, format='WAV', subtype='PCM_16') as file:
            for chunk in optimized_chunks:
                style = voice[len(chunk)-1]
                tokens = torch.LongTensor([[0, *chunk, 0]])
                output = model_torch(tokens, style)
                audio_length += len(output) / 24000
                file.write(output)
        end_time = time.time()
        duration = end_time - start_time
        writer.writerow(['torch', device, r+1, duration, audio_length, audio_length/duration])
        report.flush()

report.close()
print('Done!')

Results

Reported RTF (length of resulting audio / processing time, higher is better), averaged over 5 distinct runs, rounded.

Instance Type	PyTorch CPU	ONNX CPU	PyTorch CUDA	ONNX CUDA
g4dn.xlarge	-	-	36x	20x
g5.xlarge	-	-	96x	32x
g6.xlarge	-	-	81x	37x
c6a.8xlarge	5x	5x	-	-

efemaer/kokoro-v1.0-benchmark.md