To run the model, you need a top-end video card.
Chinese researchers have published a neural network capable of relatively acceptable generation of videos based on a text description. Experiments in this area have been conducted for a long time, but before the researchers could not avoid inconsistencies between frames created by AI.
The model uses more than one and a half billion parameters, because of this, it requires a fairly large amount of memory to generate video. It is recommended to run the generation on a video card with at least 16 gigabytes of memory, there are enthusiasts who managed to get by 12 gigabytes.
The Chinese development with consistency is much better. In most demos, neither the backgrounds nor the characters that are actively moving in the foreground are almost buggy. True, the characters themselves may look strange.
However, note enthusiasts, the peculiar appearance of the characters in the commercials is not so scary. The first arts created by neural networks like DALL-E or Midjourney did not look very good either, but now the generation quality has been significantly improved.
It is much more important that the creators of the Chinese model managed to ensure consistency in the appearance, shape and size of the characters in the commercials. Probably, in the new versions of the neural network, it will be possible to improve the quality of generation as a whole.
You can ignore the Shutterstock inscription that appears in many videos: the model was trained on photo stock videos that have the corresponding inscription.
You can download the model here, you can try to experiment with it online here.
This is interesting