封面
市場調查報告書
商品編碼
1660087

AI基礎模式與汽車領域的用途(2024年~2025年)

Research Report on AI Foundation Models and Their Applications in Automotive Field, 2024-2025

出版日期: | 出版商: ResearchInChina | 英文 340 Pages | 商品交期: 最快1-2個工作天內

價格
簡介目錄

推理能力提升了底層模型的表現

自2024年下半年以來,國內外基礎模型公司紛紛推出推理模型,並利用Chain-of-Thought(CoT)等推理框架,增強基礎模型處理複雜任務和自主決策的能力。

這次重點發布推理模型,旨在強化底層模型處理複雜場景的能力,為Agent應用奠定基礎。例如,這可能包括在複雜語義背景下提高駕駛艙助理的意圖識別,或提高自動駕駛規劃和決策中時空預測的準確性。

2024年,汽車搭載基礎模型的主流推理技術主要圍繞CoT及其變體,如思考樹(ToT)、思考圖(GoT)、思考森林(FoT),結合生成模型(如擴散模型)、知識圖譜、因果推理模型、累積推理、不同情境下的多模態推理鏈。

例如,吉利提出的模組化思維語言模型(MeTHanol),使底層模型能夠合成人類思維並監督LLM的隱藏層,產生類似人類的思維行為並適應日常對話和個性化提示,從而增強大規模語言模型的思考和推理能力,提高可解釋性。

2025年,推理技術的重點將轉向多模態推理。常見的訓練技術包括指令微調、多模態情境學習和多模態 CoT(M-CoT),通常透過結合多模態融合對齊和 LLM 推理技術來實現。

可解釋性建立了人工智慧和使用者之間的信任。

使用者需要信任AI,才能體驗到它的 "用處" 。 2025年,人工智慧系統的可解釋性將成為增加汽車人工智慧用戶群的主要因素。可以透過展示較長的 CoT 來解決這個課題。

人工智慧系統的可解釋性可以在三個層面實現:資料可解釋性、模型可解釋性和事後可解釋性。

以理想汽車為例,其L3級自動駕駛採用“AI推理可視化技術”,直觀呈現端到端+VLM模型的思維過程,涵蓋從物理世界的感知輸入,到基礎模型輸出的駕駛決策的全過程,增強用戶對智能駕駛系統的信任。

在理想汽車的 "AI推理視覺化技術" 中

注意力系統顯示車輛感知的交通和環境訊息,透過即時視訊串流評估交通參與者的行為,並以熱力學圖的方式展示評估目標。

端到端(E2E)模型展示了駕駛軌跡輸出背後的思考過程。該模型考慮了各種駕駛軌跡,提出了10種可能的輸出結果,最終採用最可能的輸出結果作為駕駛軌跡。

視覺語言模型 (VLM) 提供了基於對話的感知、推理和決策視圖。

各種推理模型的對話式介面同樣採用長 CoT 來分解推理過程。例如在DeepSeek R1中,與使用者對話時,先使用CoT呈現每個節點所做的決策,然後用自然語言提供解釋。

此外,大多數推理模型,例如智普的GLM-Zero-Preview、阿里巴巴的QwQ-32B-Preview和Skywork 4.0 o1,都支援長CoT推理過程的演示。

本報告提供中國的汽車產業的相關調查,提供AI基礎模式概要,種類,通用技術,企業,汽車的應用案例等資訊。

目錄

第1章 AI基礎模式概要

  • AI模式的簡介
  • 基礎模式的簡介

第2章 不同的類型的AI基礎模式的分析

  • 大規模語言模式(LLM)
  • 多模態大規模語言模式(MLLM)
  • 視覺語言模式(VLM)和視覺語言行動(VLA)模式
  • 世界模式

第3章 AI基礎模式的通用技術

  • 基礎模式的架構,相關演算法
  • 視覺處理演算法
  • 訓練,微調整技術
  • 強化學習
  • 知識圖表
  • 推論技術
  • supasu化
  • 生成技術

第4章 AI基礎模式企業

  • OpenAI
  • Google
  • Meta
  • Anthropic
  • Mistral AI
  • Amazon
  • Stability AI
  • xAI
  • Abu Dhabi Technology Innovation Institute
  • SenseTime
  • Alibaba Cloud
  • Baidu AI Cloud
  • Tencent Cloud
  • ByteDance & Volcano Engine
  • Huawei
  • Zhipu AI
  • Flytek
  • DeepSeek

第5章 汽車的AI基礎模式的應用案例

  • 駕駛座的案例
  • 智慧駕駛的案例

第6章 AI基礎模式的應用趨勢

  • 資料
  • 演算法
  • 運算電力
  • 工程
簡介目錄
Product Code: GX016

Research on AI foundation models and automotive applications: reasoning, cost reduction, and explainability

Reasoning capabilities drive up the performance of foundation models.

Since the second half of 2024, foundation model companies inside and outside China have launched their reasoning models, and enhanced the ability of foundation models to handle complex tasks and make decisions independently by using reasoning frameworks like Chain-of-Thought (CoT).

The intensive releases of reasoning models aim to enhance the ability of foundation models to handle complex scenarios and lay the foundation for Agent application. In the automotive industry, improved reasoning capabilities of foundation models can address sore points in AI applications, for example, enhancing the intent recognition of cockpit assistants in complex semantics and improving the accuracy of spatiotemporal prediction in autonomous driving planning and decision.

In 2024, reasoning technologies of mainstream foundation models introduced in vehicles primarily revolved around CoT and its variants (e.g., Tree-of-Thought (ToT), Graph-of-Thought (GoT), Forest-of-Thought (FoT)), and combined with generative models (e.g., diffusion models), knowledge graphs, causal reasoning models, cumulative reasoning, and multimodal reasoning chains in different scenarios.

For example, the Modularized Thinking Language Model (MeTHanol) proposed by Geely allows foundation models to synthesize human thoughts to supervise the hidden layers of LLMs, and generates human-like thinking behaviors, enhances the thinking and reasoning capabilities of large language models, and improves explainability, by adapting to daily conversations and personalized prompts.

In 2025, the focus of reasoning technology will shift to multimodal reasoning. Common training technologies include instruction fine-tuning, multimodal context learning, and multimodal CoT (M-CoT), and are often enabled by combining multimodal fusion alignment and LLM reasoning technologies.

Explainability bridges trust between AI and users.

Before users experience the "usefulness" of AI, they need to trust it. In 2025, the explainability of AI systems therefore becomes a key factor in increasing the user base of automotive AI. This challenge can be addressed by demonstrating long CoT.

The explainability of AI systems can be achieved at three levels: data explainability, model explainability, and post-hoc explainability.

In Li Auto's case, its L3 autonomous driving uses "AI reasoning visualization technology" to intuitively present the thinking process of end-to-end + VLM models, covering the entire process from physical world perception input to driving decision outputted by the foundation model, enhancing users' trust in intelligent driving systems.

In Li Auto's "AI reasoning visualization technology":

Attention system displays traffic and environmental information perceived by the vehicle, evaluates the behavior of traffic participants in real-time video streams and uses heatmaps to display evaluated objects.

End-to-end (E2E) model displays the thinking process behind driving trajectory output. The model thinks about different driving trajectories, presents 10 candidate output results, and finally adopts the most likely output result as the driving path.

Vision language model (VLM) displays its perception, reasoning, and decision-making processes through dialogue.

Various reasoning models' dialogue interfaces also employ a long CoT to break down the reasoning process as well. Examples include DeepSeek R1 which during conversations with users, first presents the decision at each node through a CoT and then provides explanations in natural language.

Additionally, most reasoning models, including Zhipu's GLM-Zero-Preview, Alibaba's QwQ-32B-Preview, and Skywork 4.0 o1, support demonstration of the long CoT reasoning process.

DeepSeek lowers the barrier to introduction of foundation models in vehicles, enabling both performance improvement and cost reduction.

Does the improvement in reasoning capabilities and overall performance mean higher costs? Not necessarily, as seen with DeepSeek's popularity. In early 2025, OEMs have started connecting to DeepSeek, primarily to enhance the comprehensive capabilities of vehicle foundation models as seen in specific applications.

In fact, before DeepSeek models were launched, OEMs had already been developing and iterating their automotive AI foundation models. In the case of cockpit assistant, some of them had completed the initial construction of cockpit assistant solutions, and connected to cloud foundation model suppliers for trial operation or initially determined suppliers, including cloud service providers like Alibaba Cloud, Tencent Cloud, and Zhipu. They connected to DeepSeek in early 2025, valuing the following:

Strong reasoning performance: for example, the R1 reasoning model is comparable to OpenAI o1, and even excels in mathematical logic.

Lower costs: maintain performance while keeping training and reasoning costs at low levels in the industry.

By connecting to DeepSeek, OEMs can really reduce the costs of hardware procurement, model training, and maintenance, and also maintain performance, when deploying intelligent driving and cockpit assistants:

Low computing overhead technologies facilitate high-level autonomous driving and technological equality, which means high performance models can be deployed on low-compute automotive chips (e.g., edge computing unit), reducing reliance on expensive GPUs. Combined with DualPipe algorithm and FP8 mixed precision training, these technologies optimize computing power utilization, allowing mid- and low-end vehicles to deploy high-level cockpit and autonomous driving features, accelerating the popularization of intelligent cockpits.

Enhance real-time performance. In driving environments, autonomous driving systems need to process large amounts of sensor data in real time, and cockpit assistants need to respond quickly to user commands, while vehicle computing resources are limited. With lower computing overhead, DeepSeek enables faster processing of sensor data, more efficient use of computing power of intelligent driving chips (DeepSeek realizes 90% utilization of NVIDIA A100 chips during server-side training), and lower latency (e.g., on the Qualcomm 8650 platform, with computing power of 100TOPS, DeepSeek reduces the inference response time from 20 milliseconds to 9-10 milliseconds). In intelligent driving systems, it can ensure that driving decisions are timely and accurate, improving driving safety and user experience. In cockpit systems, it helps cockpit assistants to quickly respond to user voice commands, achieving smooth human-computer interaction.

Table of Contents

Definitions

1 Overview of AI Foundation Models

  • 1.1 Introduction to AI Models
  • Definition and Features of AI Models
  • Classification of AI Models by Architecture
  • Classification of AI Models by Task Type/Training Method
  • Classification of AI Models by Supervision Mode
  • Classification of AI Models by Modality
  • Application Process of AI Models
  • 1.2 Introduction to Foundation Models
  • Classification of Foundation Models
  • Current Development of Foundation Models in Automotive Industry
  • Application Scenarios of Foundation Models in Automotive Industry
  • Application Case 1: Application of LLM in Autonomous Driving
  • Application Case 2: Application of VFM in Autonomous Driving
  • Application Case 3: Application of MFM in Autonomous Driving

2 Analysis of AI Foundation Models of Differing Types

  • 2.1 Large Language Models (LLM)
  • Development History of LLM
  • Key Capabilities of LLM
  • Cases of Integration with Other Models
  • 2.2 Multimodal Large Language Models (MLLM)
  • Development and Overview of Large Multimodal Models
  • Large Multimodal Models VS. Large Single-modal Models (1)
  • Large Multimodal Models VS. Large Single-modal Models (2)
  • Technology Panorama of Large Multimodal Models
  • Multimodal Information Representation
  • Multimodal Large Language Models (MLLM)
  • Architecture and Core Components of MLLM
  • Status Quo of MLLM
  • Dataset Evaluation by Different MLLM Representatives
  • Reasoning Capabilities of MLLM
  • Synergy between MLLM and Agent
  • Application Case 1: Application of MLLM in VQA
  • Application Case 2: Application of MLLM in Autonomous Driving
  • 2.3 Vision-Language Models (VLM) and Vision-Language-Action (VLA) Models
  • Development History of VLM
  • Application of VLM
  • Architecture of VLM
  • Evolution of VLM in Intelligent Driving
  • Application Scenarios of VLM: End-to-end Autonomous Driving
  • Application Scenarios of VLM: Combination with Gaussian Framework
  • VLM->VLA
  • VLA Models
  • Principles of VLA
  • Classification of VLA Models
  • Application Cases of VLA (1)
  • Application Cases of VLA (2)
  • Application Cases of VLA (3)
  • Application Cases of VLA (4)
  • Case 1: Core Functions of End-to-End Multimodal Model for Autonomous Driving (EMMA)
  • Case 2: World Model Construction
  • Case 3: Improve Vision-Language Navigation Capabilities
  • Case 4: VLA Generalization Enhancement
  • Case 5: Computing Overhead of VLA
  • 2.4 World Models
  • Key Definitions of World Models and Application Development
  • Basic Architecture of World Models
  • Framework Setup and Implementation Challenges of World Models
  • Video Generation Methods Based on Transformer and Diffusion Models
  • Technical Principle and Path of WorldDreamer
  • World Models and End-to-end Intelligent Driving
  • World Models and End-to-end Intelligent Driving: Data Generation
  • Case 1: Tesla World Model
  • Case 2: NVIDIA
  • Case 3: InfinityDrive
  • Case 4: Worlds Labs Spatial Intelligence
  • Case 5: NIO
  • Case 6: 1X's "World Model"

3 Common Technologies in AI Foundation Models

  • Common Foundation Model Algorithms and Architectures
  • Comparison of Features and Application Scenarios between Foundation Model Algorithms
  • 3.1 Foundation Model Architectures and Related Algorithms
  • Transformer: Architecture and Features
  • Transformer: Algorithm Mechanisms
  • Transformer: Multi-head Attention Mechanisms and Their Variants
  • KAN: Potential to Replace MLP
  • KAN: Cases of Integration with Transformer Architecture
  • MAMBA: Introduction
  • MAMBA: Architectural Foundations
  • MAMBA: Latest Developments
  • MAMBA: Application Scenarios
  • MAMBA: Cases of Integration with Transformer Architecture
  • Applicability of CNN in the Era of Foundation Models
  • Applicability of RNN Variants in the Era of Foundation Models
  • 3.2 Visual Processing Algorithms
  • Common Vision Algorithms
  • ViT
  • CLIP Scenarios and Features
  • CLIP Workflow
  • LLaVA Model
  • 3.3 Training and Fine-Tuning Technologies
  • Foundation Model Training Process
  • Training Case: Geely's CPT Enhancement Solution
  • Instruction Fine-tuning
  • Training Case: Geely's Fine-tuning Framework for Multi-round Dialogues
  • 3.4 Reinforcement Learning
  • Introduction to Reinforcement Learning
  • Reinforcement Learning Process
  • Comparison between Some Reinforcement Learning Technology Routes
  • Cases of Reinforcement Learning (1)-(3)
  • 3.5 Knowledge Graphs
  • Optimization Directions for Retrieval-Augmented Generation (RAG)
  • Evolution Directions of RAG (1): KAG
  • Evolution Directions of RAG (2): CAG
  • Evolution Directions of RAG (3): GraphRAG
  • RAG Application Case 1:
  • RAG Application Case 2:
  • RAG Application Case 3: Li Auto
  • RAG Application Case 4: Geely
  • Comparison between RAG Routes
  • Function Call
  • 3.6 Reasoning Technologies
  • Reasoning Process of Transformer Models
  • Evaluation of Reasoning Capabilities
  • Three Optimization Directions for Foundation Model Reasoning
  • Reasoning Task Types (1)
  • Reasoning Task Types (2)
  • Reasoning Task Types (3)
  • Common Reasoning Algorithm 1: CoT
  • Common Reasoning Algorithm 2: GoT/ToT
  • Comparison between Common Reasoning Algorithms
  • Common Reasoning Algorithm 3: PagedAttention
  • Reasoning Case 1: Geely
  • Reasoning Case 2: NVIDIA
  • 3.7 Sparsification
  • Characteristics of MoE Architecture
  • Principles of MoE Architecture
  • MoE Training Strategies
  • Advantages and Challenges of MoE
  • MoE Models from Different Foundation Model Companies
  • Evolution Direction of MoE
  • 3.8 Generation Technologies
  • Introduction to Generative Models
  • Comparison between Generation Technologies
  • Case 1: Li Auto
  • Case 2: XPeng
  • Case 3: SAIC

4 AI Foundation Model Companies

  • Development History of Mainstream Foundation Models
  • Mainstream Foundation Models and Their Companies (Foreign)
  • Mainstream Foundation Models and Their Companies (Chinese)
  • Rankings of Evaluated Foundation Models
  • 4.1 OpenAI
  • Product Layout
  • Product Iteration History
  • GPT Series: Features
  • GPT Series: Architecture
  • From GPT-4V to 4o
  • Reasoning Model OpenAI o1
  • SORA: Features
  • SORA: Performance Evaluation
  • SORA: Advantages and Limitations
  • 4.2 Google
  • Development History of Foundation Models
  • Typical Model BERT: Architecture
  • Typical Model BERT: Variants
  • Gemini Model
  • Cases of Foundation Models in the Automotive Industry
  • 4.3 Meta
  • LLAMA3.3
  • LLAMA Series: Evolution
  • LLAMA Series: Features
  • LLAMA Series: Training Methods
  • LLAMA Series: Alpaca
  • LLAMA Series: Vicuna
  • 4.4 Anthropic
  • Claude Performance Evaluation
  • Claude-based PC-side Agent
  • 4.5 Mistral AI
  • Expert Model: Architecture
  • Expert Model: Algorithm Features (1)
  • Expert Model: Algorithm Features (2)
  • Large Language Model: Mistral Large 2
  • 4.6 Amazon
  • Nova Product System
  • Application Cases of Amazon AI Cloud in the Automotive Industry (1)-(3)
  • 4.7 Stability AI
  • Product System
  • Stable Diffusion Architecture Based on Diffusion Models
  • Comparison between Stable Diffusion Video Generation Technology with Competitors
  • 4.8 xAI
  • Product System
  • Capabilities of xAI Models
  • Capabilities of Grok-2
  • Capabilities of Grok-0/1
  • 4.9 Abu Dhabi Technology Innovation Institute
  • Iteration History of Falcon Model Series
  • Parameters of Falcon 3 Series
  • Evaluation of Falcon 3 Series
  • 4.10 SenseTime
  • Major Foundation Model Product Systems
  • Major Foundation Model Product Systems
  • Foundation Model Training Facilities
  • Functional Scenarios of Foundation Models
  • Foundation Model Technologies
  • 4.11 Alibaba Cloud
  • Foundation Model Product System
  • End-cloud Integration Solutions of Foundation Models
  • 4.12 Baidu AI Cloud
  • Foundation Model Product System
  • 4.13 Tencent Cloud
  • Foundation Model Product System
  • Reasoning Service Solutions (1)-(3)
  • Generation Scenario Solutions for Foundation Models
  • Q&A Scenario Solutions for Foundation Models
  • 4.14 ByteDance & Volcano Engine
  • Doubao Model System
  • Functional Highlights of Volcano Engine's Cockpit
  • 4.15 Huawei
  • Pangu Model Product System
  • Application Cases of Pangu Models in Data Synthesis
  • LLM Architecture of Pangu Models
  • Capabilities of Pangu Models: Multimodal Technology
  • Capabilities of Pangu Models: Thinking & Reasoning Technology
  • AI Cloud Services of Pangu Models
  • 4.16 Zhipu AI
  • Product System
  • Foundation Model Base in the Automotive Industry
  • Technical Features
  • 4.17 Flytek
  • Product System
  • Functional and Technical Highlights
  • Cockpit AI System
  • 4.18 DeepSeek
  • Product System
  • Technical Inspiration from DeepSeek V3
  • Technical Highlights of DeepSeek R1
  • Application Cases of DeepSeek (1)-(3)

5 Application Cases of AI Foundation Models in Automotive

  • 5.1 Cockpit Cases
  • Lenovo's AI Vehicle Computing Framework Used in Cockpits
  • In-cabin Functions of Thundersoft's Rubik Foundation Model
  • LLM Empowers Smart Eye's DMS/OMS Assistance System
  • Application of DIT in Voice Processing Scenarios
  • Application of Unisound's Shanhai Model in Cockpits
  • Phoenix Auto Intelligence's Cockpit Smart Brain
  • 5.2 Intelligent Driving Cases
  • Li Auto: Multimodal Technology in Autonomous Driving (1)
  • Li Auto: Multimodal Technology in Autonomous Driving (2)
  • Li Auto: Multimodal Technology in Autonomous Driving (3): Overcoming 2D Limitations
  • Li Auto: Data Generation Technology (1)
  • Li Auto: Data Generation Technology (2)
  • Li Auto: CoT Technology in DriveVLM
  • Li Auto: Application of Visual Processing
  • Li Auto: Data Selection
  • Geely: Application of Visual Processing
  • Geely: Multimodal Learning Framework
  • Waymo: Generative World Model GAIA-1
  • Tesla: Algorithm Architecture (Including NeRF)
  • Tesla: Skeleton, Neck, and Head of Vision Algorithms
  • Tesla: Core of Visual System - HydraNet
  • Giga's World Model

6 Application Trends of AI Foundation Models

  • 6.1 Data
  • Trend 1:
  • Trend 2:
  • 6.2 Algorithm
  • Trend 1:
  • Trend 2:
  • Trend 3
  • Trend 4:
  • 6.3 Computing Power
  • Trend 1:
  • Trend 2:
  • 6.4 Engineering
  • Trend 1
  • Trend 2