The hottest topic in the category of big models has returned to video models, as a conservative company is believed to have made the “strongest domestic Sora”.
On April 27th, at the Zhongguancun Forum and the Artificial Intelligence Frontier Forum, Shushu Technology, in conjunction with Tsinghua University, announced the video mod Vidu, which has the functional labels of “long duration, high consistency, and high dynamism”. It will not directly develop high-definition videos that can reach 16 seconds and 1080P resolution based on text descriptions.
High consistency is a characteristic that teams tend to exaggerate. Zhu Jun, Vice Dean of the Institute of Artificial Intelligence at Tsinghua University and Chief Scientist of Shushu Technology, stated that currently, the duration of Tianshu videos for domestic video giants is mostly around 4 seconds, while Vidu has not fulfilled the one-time Tianshu 16 second video duration. At the same time, the video image can maintain coherence and smoothness, and with the camera still, the characters and scenes can maintain high consistency in time and space.
In terms of motion and stillness, Vidu’s motion and stillness shots not only involve pulling, pulling, and moving, but also involve the switching of distant, close, medium, and close-up shots in a scene, as well as the consequences of directly developing shots, escaping focus, and transitioning. In terms of physical discipline, Zhu Jun introduced that Vidu can simulate scenes in the real world of physics with complex details that are in line with physical discipline, such as reasonable light and shadow consequences, delicate facial expressions of characters, etc. It is also possible for Tian Shu to possess deep and complex surrealistic essence (such as the cat picking precious earrings).
In the video materials released by Shushu Technology, there are indeed many users who have expressed doubts about the consistency between time and space, which is the key achievement that video models need to overcome in terms of long-term physical maturity.
Consistency cannot be solely explored beyond the duration of the video. At present, the minimum duration issued by Vidu to the public is 16 seconds, while the maximum duration for Sora is 1 minute. After Sora was launched in February this year, Shushu Technology established an internal research and development team to accelerate the development progress of the original video bias. In March, the internal team cashed out an 8-second video maturity and was promoted to a 16 second maturity in April, but the team did not disclose any more details on breaking through the skills.
A skilled person in handling multimodal large mold co production told Interface News reporters that duration is not the most crucial factor, because as long as the camera is operated sufficiently and slowly in a single scene, whether the duration is controlled or not. Sora’s initial amazement was mainly due to her ability to demonstrate pure modeling and multi scene splicing, as well as her ability to adapt to subjective physical discipline in videos that are highly skilled in large standards and multiple perspectives.
But this point is not lost in Vidu’s videos, where “each shot has a shorter length and does not involve complex semantic switching elements.” He stated that Vidu has overall improved its spatiotemporal resolution compared to the current joint source plan, but there is no difference in quality.
From the existing information, Vidu has adopted a self-developed U-ViT architecture, which, like Sora, is a fusion architecture of Diffusion and Transformer. This architecture does not adopt a multi-step penalty form of frame insertion to deal with mature videos, but instead uses a single step “end-to-end” to directly mature the essence, and the transformation from text to video is direct and continuous.
This means that Vidu also needs to continuously deposit larger parameters and more computing power in the Scaling Law, which cannot be practiced by the mold.
Beyond the limitations of computing power, a practitioner with a history of practicing multimodal big models told Interface News reporters that the difference in Tianshu data is an important difference between domestic video big models and Sora. The video big model requires a large amount of data network, and its path is a gradual refinement and affirmation process. In terms of redemption, this is a definite upheaval, but it will take some time.
Therefore, although the advancement of Shushu Technology within two months is already a breakthrough in algorithm and engineering talent, there is still a long way to go to truly benchmark Sora with the same performance and escape the 1-minute duration of Sora – at least not a simple linear budget like “twice as long as two months”.
“The difference between 16 seconds and 1 minute may seem like four times, but the accumulated deviation in the middle can be filled by more than four times the computing power or engineering,” a large model investor told Interface News reporters.
He also pointed out that, like Sora, Vidu currently does not have sufficient materials for his release, and from the materials he has released, the consistency is indeed good, but it is still difficult to make a more accurate judgment.
From this perspective, the direct comparison between Vidu and Sora may lie more in the dynamic and static aspects represented by camera speaking, as well as their understanding of the discipline of physics and their ability to simulate. The focus function composed of long duration and consistency still needs to be compared further in future version iterations.