mkv or mp4 shouldn't matter because they get decoded before getting filtered. Try removing just that option ( :force_divisible_by=2 ) and see if it works.
That select filter is a lot to digest, here I'll highlight some of the key tunable numbers so you can adjust it and find what works best for your content.
ffmpeg -i "video" -vsync vfr -vf "select=if(gt(scene\,0.5)*(isnan(prev_selected_t)+gte(t-prev_selected_t\,2))\,st(1\,t)*0*st(2\,ld(2)+1)\,if(ld(1)*lt(ld(2)\,4)\,between(t\,ld(1)+2\,ld(1)+4))),scale=320:180:force_original_aspect_ratio=decrease:flags=bicubic+full_chroma_inp:sws_dither=none,framestep=2,setpts=N/(12*TB)" -an -sn -map_metadata -1 -compression_level 5 -q:v 75 -loop 0 -f webp -y "out.webp"
1st "2" is the minimum number of seconds between captured scene cuts.
2nd "4" is n+1 number of scene cuts to capture.
3rd "2" is number of seconds after the scene cut to start capture.
4th "4" is number of seconds after the scene cut to stop capture.
5th "12" is half of the input video's framerate.
You can capture the duration and framerate with this, it works in bash on Linux. I know Mac OS has bash installed (although no longer the default shell) but I'm not sure if this will work on it, not super familiar with Mac.
metaArray=($(ffprobe -v 0 -select_streams V:0 -show_entries stream=r_frame_rate:format=duration -of default=nw=1:nk=1 "$1")) halfFramerate=$(bc <<< "scale=3;${meta[0]}/2") minSceneDistance=$(bc <<< "scale=3;${meta[1]}/12")
Why divide into 12 segments to get 6 scenes? Because after that 1/12 time spacer, you have to continue seeking until the next scene cut after that, so the real space between them ends up being higher. This is where content type really matters, because things like sitcoms have way fewer scene cuts than say animation. So shows like that you may only end up with four instead of six segments, and need to decrease the time interval (increase the divisor).
Here seems to be an alternative https://gist.github.com/Voldrix/84a01b602e5d6c53c2b67e156bf26a10