Moving from "data fusion" to "native architecture": SenseTime releases NEO architecture, redefining the performance boundary of multimodal models

简体中文 English Русский язык Français اللغة العربية

Full scenario security solution service provider

JIANGSU TG SECURITY TECHNOLOGY CO..LIMITED

TG security

NEWS

NEWS

Moving from "data fusion" to "native architecture": SenseTime releases NEO architecture, redefining the performance boundary of multimodal models

SenseTime has officially released and opened sourced its new multimodal model architecture, NEO, developed in collaboration with Nanyang Technological University S-Lab, laying the foundation for a new generation architecture for the SenseNova multimodal model.

As the industry's first available native multimodal architecture (Native VLM) that achieves deep integration, NEO breaks the shackles of the traditional "modular" paradigm from the underlying principles, and innovatively designs for "being born specifically for multimodal". Through the deep integration of multiple modalities at the core architecture level, it achieves a comprehensive breakthrough in performance, efficiency, and universality, redefines the performance boundary of multimodal models, and marks the official entry of artificial intelligence multimodal technology into a new era of "native architecture".

Breaking through bottlenecks, bidding farewell to 'patchwork', embracing 'native'
Currently, the mainstream multimodal models in the industry mostly follow the modular paradigm of "visual encoder+projector+language model". This extension method based on the Large Language Model (LLM) achieves compatibility with image input, but fundamentally still focuses on language, and the fusion of images and language only stays at the data level. This "patchwork" design not only has low learning efficiency, but also limits the model's processing ability in complex multimodal scenes, such as those involving image detail capture or complex spatial structure understanding. The SenseTime NEO architecture was born to address this pain point. As early as the second half of 2024, Shangtang took the lead in breaking through multimodal native fusion training technology in China, winning the SuperCLUE language evaluation and OpenCompass multimodal evaluation with a single model, and based on this core technology, it created Nissin SenseNova 6.0 to achieve the leading ability in multimodal reasoning. Afterwards, in July 2025, the Nichiron SenseNova 6.5 was released, which achieved early fusion at the encoder level, tripled the cost-effectiveness of multimodal models, and was the first in China to launch commercial grade graphic text interleaving inference. SenseTime has taken the next step by completely abandoning the traditional modular structure and starting from the underlying principles, launching the NEO native architecture designed from scratch.
Three core innovations achieve deep unity of vision and language

The NEO architecture is based on the core concepts of ultimate efficiency and deep integration, and through underlying innovations in three key dimensions: attention mechanism, positional encoding, and semantic mapping, the model naturally possesses the ability to handle both visual and language in a unified manner

Native Patch Embedding: abandons discrete image tokenizers and constructs a continuous mapping from pixels to words from bottom to top through a unique Patch Embedding Layer (PEL). This design can capture image details more finely, fundamentally breaking through the bottleneck of image modeling in mainstream models.

Native RoPE: innovatively decouples three-dimensional spatiotemporal frequency allocation, using high frequencies in the visual dimension and low frequencies in the textual dimension, perfectly adapting to the natural structure of both modalities. This not only enables NEO to accurately capture the spatial structure of images, but also has the potential to seamlessly extend to complex scenes such as video processing and cross frame modeling.

Native Multi Head Attention: In response to the characteristics of different modalities, NEO has implemented both autoregressive attention for text tokens and bidirectional attention for visual tokens within a unified framework. This design greatly enhances the utilization of spatial structure correlation in the model, thereby better supporting complex graphic and textual mixed understanding and reasoning.

In addition, with the innovative Pre Buffer&Post LLM two-stage fusion training strategy, NEO can absorb the complete language reasoning ability of the original LLM while building powerful visual perception ability from scratch, completely solving the problem of language ability impairment in traditional cross modal training.

Tested performance: One tenth of the data is used to evaluate flagship level performance
Driven by architectural innovation, NEO has demonstrated astonishing data efficiency and performance advantages: extremely high data efficiency: with only 1/10 of the data volume of industry equivalent performance models (390 million image text examples), NEO can develop top-notch visual perception capabilities. Without relying on massive data and additional visual encoders, its concise architecture can match top modular flagship models such as Qwen2-VL and InternVL3 in multiple visual understanding tasks. Excellent and balanced performance: In multiple public authoritative evaluations such as MMMU, MMB, MMStar, SEED-I, POPE, etc., the NEO architecture has achieved high scores, demonstrating comprehensive performance superior to other native VLMs, truly realizing the "precision lossless" of native architecture. Ultimate reasoning cost-effectiveness: Especially within the parameter range of 0.6B-8B, NEO has significant advantages in edge deployment. It not only achieves a dual leap in accuracy and efficiency, but also significantly reduces inference costs, pushing the "cost-effectiveness" of multimodal visual perception to the extreme.Open source co construction

Building the next generation of AI infrastructure architecture is the "skeleton" of the model, and only with a solid skeleton can it support the future of multimodal technology.

The early fusion design of NEO architecture supports arbitrary resolution and long image input, seamlessly extending to cutting-edge fields such as video and embodied intelligence, achieving true fusion from bottom to top and end-to-end. From an application perspective, the end-to-end "native integration" design provides solid technical support for the application of diverse scenarios such as robot embodied interaction, intelligent terminal multimodal response, video understanding, 3D interaction, and embodied intelligence.

At present, SenseTime has officially open sourced two specification models based on the NEO architecture, 2B and 9B, to promote innovation and application in the open-source community on native multimodal architecture. SenseTime Technology stated that it is committed to building NEO into a scalable and reusable next-generation AI infrastructure through open source collaboration and scenario implementation, promoting the widespread industrial application of native multimodal technology from the laboratory, and accelerating the construction of next-generation industrial level native multimodal technology standards.

联系我们

回到顶部

About Us

Phone: 0518-80236699

Email: shangyin998@gmail.com

Address: Haizhou District, Lianyungang City, Jiangsu Province

Ministry of Industry and Information Technology Government Service Platform

Su ICP preparation 2025211914

Su Gongwang Security No. 32070602010184

Technical Support: Jiangsu Xiaola Technology Co., Ltd

上一篇： Tejin's ne......

下一篇： Policy inc......