(Speech/sound recognition) Why do most article/book show codes that train machine using JPEG file of plotted spectrogram?

Background: Hi, I am a total beginner in machine learning with civil engineering background.
I am attempting on training a machine (using Python) to create an anomaly detecting algorithm that can detect defect inside of concrete by using sound from hammering test.

From what I understand, to train a machine for speech recognition, you need to process your sound signal using signal processing like Short Time Fourier Analysis and Wavelet Analysis. From this analysis, you will get your sound data decomposed to frequencies (and time) that it is made up of. So, a 3D array data; time-frequency-amplitude.

After that, most articles that I’ve read would plot spectrogram using this 3D array and save it in JPG/JPEG. And from the image data, it will be processed again to be fed into neural network. The rest will be the same as how we would do in training machine for image recognition algorithm.

My question is, why do we need to plot the 3D array to spectrogram (image file) and feed our machine with the image file instead of using the array directly.


A sample app showing how to implement biometric authentication in flutter

Datatable Custom Date Range Filter