

It’s quite noteworthy how often these shots start out somewhat okay at the first prompt, but then deteriorate markedly over the following seconds.
As a layperson, I would try to explain this as follows: At the beginning, the AI is - to some extent - free to “pick” how the characters and their surroundings would look like (while staying within the constraints of the prompt, of course, even if this doesn’t always work out either).
Therefore, the AI can basically “fill in the blanks” from its training data and create something that may look somewhat impressive at first glance.
However, for continuing the shot, the AI is now stuck with these characters and surroundings while having to follow a plot that may not be represented in its training data, especially not for the characters and surroundings it had picked. This is why we frequently see inconsistencies, deviations from the prompt or just plain nonsense.
If I am right about this assumption, it might be very difficult to improve these video generators, I guess (because an unrealistic amount of additional training data would be required).
Edit: According to other people, it may also be related to memory/hardware etc. In that case, my guesses above may not apply. Or maybe it is a mixture of both.
This is a very important point, I believe. I find it particularly ironic that the “traditional” Internet was fairly efficient in particular because many people were shown more or less the same content, and this fact also made it easier to carry out a certain degree of quality assurance. Now with chatbots, all this is being thrown overboard and extreme inefficiencies are being created, and apparently, the AI hypemongers are largely ignoring that.