The pricing positions GLM-Picture as an economical possibility for enterprises producing advertising and marketing supplies, shows, and different text-heavy visible content material at scale.
Technical strategy and benchmark efficiency
GLM-Picture employs a hybrid structure combining a 9-billion-parameter autoregressive mannequin with a 7-billion-parameter diffusion decoder, based on Zhipu’s technical report. The autoregressive element handles instruction understanding and total picture composition, whereas the diffusion decoder focuses on rendering wonderful particulars and correct textual content.
The structure addresses challenges in producing knowledge-intensive visible content material the place each semantic understanding and exact textual content rendering matter, corresponding to presentation slides, infographics, and business posters.
On the CVTG-2K benchmark, which measures accuracy in putting textual content throughout a number of picture places, GLM-Picture achieved a Phrase Accuracy rating of 0.9116, rating first amongst open-source fashions. The mannequin additionally led the LongText-Bench check for rendering prolonged textual content passages, scoring 0.952 for English and 0.979 for Chinese language throughout eight eventualities together with indicators, posters, and dialog packing containers.
The mannequin natively helps a number of resolutions from 1024×1024 to 2048×2048 pixels with out requiring retraining, the report added.
{Hardware} optimization technique
Coaching GLM-Picture on Ascend {hardware} required Zhipu to develop customized optimization methods for Huawei’s chip structure. The corporate constructed a coaching suite that implements dynamic graph multi-level pipelined deployment, enabling totally different phases of the coaching course of to run concurrently and lowering bottlenecks.
