Hi Shadyar,
Not an immediate solution at all, but I would say that AI (Machine Learning) which snapshots the screen or window and is able to extract the text from the snapshot image to then read it aloud, might be superior to legacy accessibility API paradigms which rely on the application developers to interleave "accessibility" (ARIA etc.) information in each and every field.
Or at least as an augmentation that should be able to provide a really great fallback to any ARIA-like paradigm.
It would be a project, sure, but it is really very accomplishable at this time and age.
Hopefully one day our desktops will be more fluid than only providing voice services on top a graphical interaction interface, but a lot can be done till then by leveraging Computer Vision AI in this space. Sorry again that this is not any immediate solution.
Matan