Maybe Whisper? This github repo: https://github.com/linto-ai/whisper-timestamped
Says thay whispher can do timestamps on speech segments. However, I don’t know if that’s what you want, since whispher might only be able to do that if it is transcribing the actual audio, rather than editing another text file.
No need for AI for that, humans can do it better:
https://youtube.com/watch?v=l7ZUZerGwK4
https://youtube.com/watch?v=zn_rx8Zyl54
If you know where to look, someone already did it.