VOICEVOX Engineのcancellable_synthesisがnon-blockingである検証

1. 目的・原理

VOICEVOXのVOICEVOX Engineにおいて、

--enable_cancellable_synthesis
--init_processes

を設定すると、1つ目の合成完了をまだずに2つ目の音声を合成できることを検証する。

ここでは、1つ目の合成完了を待ってから2つ目の音声を合成する方式をblocking、 1つ目の合成完了をまだずに2つ目の音声を合成する方式をnon-blockingと呼び区別する。

長い文章は短い文章に比べて合成時間が長くなるものである。従って、1つ目の音声を合成するための文章を、2つ目の音声を合成するための文章より長くすれば、 1つ目の音声の合成時間が2つ目の音声の合成時間より長くなる。

このとき、合成する方式がblockingであれば、1つ目の音声の合成完了時刻の後に2つ目の音声の合成が始まるため、 1つ目の音声の合成完了時刻は2つ目の音声の合成完了時刻より早い。

一方、合成する方式がnon-blockingである場合、2つの音声の合成完了時刻は逆転することがある。

つまり、2つの音声の合成完了時刻が逆転した場合、確実に1つ目のリクエストの後に2つ目のリクエストを処理している限り、合成する方式がnon-blockingであると判断できる。

そこで、VOICEVOX Engineの各設定において、長い文章の音声合成を開始した1秒後に短い文章の音声合成を開始し、その完了時刻を比較してblockingかnon-blockingかを確認する。

リクエスト間隔の1秒間は、確実に1つ目のリクエストの後に2つ目のリクエストを処理させるためである。

なお、クライアント側が非同期にリクエストを送信していることを確認するため、さらに1秒後にversion情報を取得するリクエストを送信する。 version情報取得の完了時刻は最も早くなる。

2. 実験方法

検証環境として、Ubuntu22.04を用い、aptで次のプログラムをインストールした。

sudo apt update;
sudo apt install -y erlang rebar3 docker.io;

voicevoxのバージョンは、cpu-0.14.5のdockerイメージをpullした。

sudo docker pull voicevox/voicevox_engine:cpu-0.14.5;

voicevox engineにリクエストを送信するため、erlvoxをrebar3で使えるようにした。

rebar3 new lib bench;
cd bench;
echo '{erl_opts, [debug_info]}.' > rebar.config;
echo '{deps,[{erlvox,{git,"https://github.com/ts-klassen/erlvox",{tag,"0.1.2"}}}]}.' >> rebar.config;

合成完了にかかる時間を記録するため、次のmain関数を含むerlangプログラムをbench/src/bench.erlに配置した。 main関数を含めたファイル全体を補遺Aに記す。

main(SynthesisType) ->
  {ok, ShortAudioQuery} = erlvox:audio_query({{127,0,0,1}, 50021}, 0, <<"abcdefg">>),
  {ok, LongAudioQuery} = erlvox:audio_query({{127,0,0,1}, 50021}, 0, <<"abcdefghijklmnopqrstuvwxyz">>),
  StartTime = now(),
  spawn(bench, bench, [SynthesisType, 1, self(), LongAudioQuery]),
  timer:sleep(1000),
  spawn(bench, bench, [SynthesisType, 2, self(), ShortAudioQuery]),
  timer:sleep(1000),
  spawn(bench, bench, [version, 3, self(), null]),
  io:format("~p, ~p~n", [0, 0]),
  print(StartTime, 1, 4).

次の各条件について、

rebar3 shell

のerlangシェルで、bench:main/1関数を、各条件の引数で実行した。

2.1 通常のsynthesis

次の通りのオプションで、dockerのvoicevox engineを起動し、bench:main(synthesis)を10秒間隔で3回実行した。

sudo docker run -itd -p 127.0.0.1:50021:50021 --restart=always --name voicevox voicevox/voicevox_engine:cpu-0.14.5 gosu user /opt/python/bin/python3 ./run.py --voicelib_dir /opt/voicevox_core/ --runtime_dir /opt/onnxruntime/lib --host 0.0.0.0;

2.2 processesが1のcancellable

次の通りのオプションで、dockerのvoicevox engineを起動し、 bench:main(synthesis)とbench:main(cancellable_synthesis)を10秒間隔でそれぞれ3回実行した。

sudo docker stop voicevox;
sudo docker rm voicevox;
sudo docker run -itd -p 127.0.0.1:50021:50021 --restart=always --name voicevox voicevox/voicevox_engine:cpu-0.14.5 gosu user /opt/python/bin/python3 ./run.py --voicelib_dir /opt/voicevox_core/ --runtime_dir /opt/onnxruntime/lib --host 0.0.0.0 --enable_cancellable_synthesis --init_processes 1;

2.3 processesが2のcancellable

次の通りのオプションで、dockerのvoicevox engineを起動し、 bench:main(synthesis)とbench:main(cancellable_synthesis)を10秒間隔でそれぞれ3回実行した。

sudo docker stop voicevox;
sudo docker rm voicevox;
sudo docker run -itd -p 127.0.0.1:50021:50021 --restart=always --name voicevox voicevox/voicevox_engine:cpu-0.14.5 gosu user /opt/python/bin/python3 ./run.py --voicelib_dir /opt/voicevox_core/ --runtime_dir /opt/onnxruntime/lib --host 0.0.0.0 --enable_cancellable_synthesis --init_processes 2;

3. 結果

3.1 通常のsynthesis

Erlang/OTP 24 [erts-12.2.1] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [jit]

Eshell V12.2.1  (abort with ^G)
1> bench:main(synthesis),
1> timer:sleep(10000),
1> bench:main(synthesis),
1> timer:sleep(10000),
1> bench:main(synthesis).
0, 0
1, 7996565
2, 15403448
3, 2007467
0, 0
1, 8036157
2, 14019518
3, 2006984
0, 0
1, 7881327
2, 12172089
3, 2007863
ok
2>

3.2 processesが1のcancellable

Erlang/OTP 24 [erts-12.2.1] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [jit]

Eshell V12.2.1  (abort with ^G)
1> bench:main(synthesis),
1> timer:sleep(10000),
1> bench:main(synthesis),
1> timer:sleep(10000),
1> bench:main(synthesis),
1> io:format("~n---~n~n", []),
1> timer:sleep(10000),
1> bench:main(cancellable_synthesis),
1> timer:sleep(10000),
1> bench:main(cancellable_synthesis),
1> timer:sleep(10000),
1> bench:main(cancellable_synthesis).
0, 0
1, 7830820
2, 10942437
3, 2006630
0, 0
1, 7896924
2, 11012296
3, 2006749
0, 0
1, 7802686
2, 10828065
3, 2007804

---

0, 0
1, 8296573
2, 11318060
3, 2006784
0, 0
1, 7929590
2, 10941597
3, 2006928
0, 0
1, 7889180
2, 10931320
3, 2006569
ok
2>

3.3 processesが2のcancellable

Erlang/OTP 24 [erts-12.2.1] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [jit]

Eshell V12.2.1  (abort with ^G)
1> bench:main(synthesis),
1> timer:sleep(10000),
1> bench:main(synthesis),
1> timer:sleep(10000),
1> bench:main(synthesis),
1> io:format("~n---~n~n", []),
1> timer:sleep(10000),
1> bench:main(cancellable_synthesis),
1> timer:sleep(10000),
1> bench:main(cancellable_synthesis),
1> timer:sleep(10000),
1> bench:main(cancellable_synthesis).
0, 0
1, 7811706
2, 10813868
3, 2007533
0, 0
1, 7983380
2, 10978020
3, 2006602
0, 0
1, 7804402
2, 10815812
3, 2006995

---

0, 0
1, 11510825
2, 7523660
3, 2007106
0, 0
1, 11272087
2, 7181113
3, 2006159
0, 0
1, 11061440
2, 7098972
3, 2005636
ok
2>

4. 考察

version情報取得は最後に実行しているものの、どれもversion情報取得の完了時刻が最も早いため、クライアントプログラムは意図したとおり非同期に動いていることが確認できる。

1つ目の音声と2つ目の音声は、どちらも合成に2秒以上かかっていることから、 1つ目の音声合成が完了した後に2つ目の音声合成が完了していればblocking、 2つ目の音声合成が完了した後に1つ目の音声合成が完了していればnon-blockingと判断する。

4.1 通常のsynthesis

どれも1つ目の音声合成が完了した後に2つ目の音声合成が完了しているため、音声を合成する方式はblockingと判断できる。

4.2 processesが1のcancellable

どれも1つ目の音声合成が完了した後に2つ目の音声合成が完了しているため、 synthesisとcancellable_synthesisはどちらもblockingと判断できる。

4.3 processesが2のcancellable

synthesisはどれも1つ目の音声合成が完了した後に2つ目の音声合成が完了しているのに対し、 cancellable_synthesisはどれも2つ目の音声合成が完了した後に1つ目の音声合成が完了している。つまり、synthesisはblockingで、cancellable_synthesisはnon-blockingと判断できる。

実験結果より、enable_cancellable_synthesisやinit_processesの設定に関わらずsynthesisはblockingである。また、init_processesを2にしてenable_cancellable_synthesisを設定するとcancellable_synthesisはnon-blockingになる。

一方で、init_processesが1のときは、同時に合成できる数が1つのため、blockingの挙動を示したと考えられる。

5. 結論

init_processesを2以上にしてenable_cancellable_synthesisを設定し、 synthesisの変わりにcancellable_synthesisへ合成リクエストを送信することで、同時に複数の音声を合成することが可能となる。

補遺A module bench

-module(bench).

-export([main/1, bench/4]).

main(SynthesisType) ->
  {ok, ShortAudioQuery} = erlvox:audio_query({{127,0,0,1}, 50021}, 0, <<"abcdefg">>),
  {ok, LongAudioQuery} = erlvox:audio_query({{127,0,0,1}, 50021}, 0, <<"abcdefghijklmnopqrstuvwxyz">>),
  StartTime = now(),
  spawn(bench, bench, [SynthesisType, 1, self(), LongAudioQuery]),
  timer:sleep(1000),
  spawn(bench, bench, [SynthesisType, 2, self(), ShortAudioQuery]),
  timer:sleep(1000),
  spawn(bench, bench, [version, 3, self(), null]),
  io:format("~p, ~p~n", [0, 0]),
  print(StartTime, 1, 4).

print(_, No, No) -> ok;
print(StartTime, No, UpToNo) ->
  receive
    {endtime, No, EndTime} ->
      TimeDiff = timer:now_diff(EndTime, StartTime),
      io:format("~p, ~p~n", [No, TimeDiff]),
      print(StartTime, No+1, UpToNo)
  end.

bench(synthesis, No, Pid, AudioQuery) ->
  {ok, _} = erlvox:synthesis({{127,0,0,1}, 50021}, 0, AudioQuery),
  Pid ! {endtime, No, now()};

bench(cancellable_synthesis, No, Pid, AudioQuery) ->
  {ok, _} = erlvox:cancellable_synthesis({{127,0,0,1}, 50021}, 0, AudioQuery),
  Pid ! {endtime, No, now()};

bench(version, No, Pid, _) ->
  {ok, _} = erlvox:version({{127,0,0,1}, 50021}),
  Pid ! {endtime, No, now()}.

ts-klassen/README.md