I read a lot of reports online of these sensors being unreliable, or being difficult to tune. The latter part is partly true, but there's a lot of misunderstanding about how to interpret the data. Once you understand how you should interpret it, it becomes much clearer how you should calibrate it and how to configure your ESPHome device to use the data it provides.
My first thought was higher measured voltage = louder noise. After all, every sensor I'd worked with worked in this way, but most of them reported the 'human friendly' value - e.g. temperature, CO2, humidity, etc.
Sound is different. You probably know that sound is made up of waves, and you've probably seen a waveform corresponding to audio.
Louder sounds aren't represented by just a higher peak, but by a higher deviation. For example, your baseline is set at 0. When a sound happens, the waveform starts and you have values that are positive and negative - so shouting might be represented as many po