In reality, an editor could have speech-to-text and it would transcribe the spoken comment into a comment with some tag to indicate it was a spoken comment. Then when an editor encounters such a comment, it would read it out using text-to-speech. For example
// transcript: Holy fuck what is wrong with this stupid code‽ for fucks sake! *inaudible* I've spent hours on this. I'm going to... nevermind it was a semicolon. Undo comment. Remove comment. Cancel comment.