市場調查報告書
商品編碼
1482384
中國端對端自動駕駛(E2E AD)產業(2024年)End-to-end Autonomous Driving (E2E AD) Research Report, 2024 |
端對端自動駕駛系統是指將感測器資料輸入(攝影機影像、光達等)直接對應到控制指令輸出(轉向、加速、減速等)。它於 1988 年首次出現在 ALVINN 項目中。我們使用一個簡單的神經網絡,該網絡使用相機和雷射測距儀作為輸入,並產生轉向作為輸出。
2024年初,Tesla發表FSD V12.3,大肆宣揚其令人難以置信的智慧駕駛水準。這一端到端的自動駕駛解決方案引起了中國OEM和自動駕駛解決方案公司的廣泛關注。
與傳統的多模組解決方案相比,端到端自動駕駛解決方案將感知、預測和規劃整合到一個模型中,簡化了解決方案結構。它可以模擬人類駕駛員直接根據視覺輸入做出駕駛決策,能夠以模組化解決方案有效處理長尾場景,提高模型訓練效率和效能。
Li Auto端對端解決方案
Li Auto認為,完整的端到端模型應該涵蓋感知、追蹤、預測、決策和規劃的整個過程,是實現L3級自動駕駛的最優解決方案。2023年,Li Auto將推廣AD Max3.0,整體框架體現了端到端的概念,但距離完整的端到端解決方案仍有差距。2024年,Li Auto預計將系統升級為完整的端到端解決方案。
Li Auto的自動駕駛框架如下,由兩個系統組成。
高速系統:系統1,Li Auto現有的端對端解決方案,識別周圍情況後直接執行。
Slow System:系統2,多模態大規模語言模型,能夠進行邏輯思考並探索未知環境,以解決未知L4場景中的問題。
在推廣端到端解決方案的過程中,理想汽車計畫在自己的基礎上實現端到端的Temporal Planner,將規劃/預測模型和感知模型融合起來,將停車與駕駛融為一體。
實現端到端的解決方案需要經過研發團隊搭建、硬體設備、資料擷取與處理、演算法訓練與策略客製化、驗證與評估、推廣、量產等流程。
端到端自動駕駛解決方案的整合訓練需要大量數據,並面臨數據收集和處理的課題。
首先,需要較長的時間和管道來收集數據,包括駕駛數據和道路、天氣、交通狀況等場景數據。在現實駕駛中,駕駛者前方視野內的數據相對容易收集,但週遭環境的資訊卻很難收集。
在資料處理過程中,需要設計資料擷取維度,從海量影片中有效提取特徵,並進行資料分佈統計,以支援大規模資料學習。
除了自動駕駛汽車之外,實體機器人也是端到端解決方案的主流場景。我們需要建立更通用的世界模型,以適應更複雜、更多樣化的現實應用場景,從端到端的自動駕駛到機器人。主流的AGI(通用人工智慧)開發框架可以分為兩個階段:
第一階段:統一基礎模型的理解與生成,並進一步與體現AI結合,形成統一的世界模型
第二階段:世界模型+規劃與控制複雜任務的能力,歸納抽像概念逐漸演進互動式AGI 1.0時代
本報告針對中國端對端自動駕駛(E2E AD)產業進行調查分析,總結自動駕駛現況、發展趨勢、應用實例等資訊。
End-to-end Autonomous Driving Research: status quo of End-to-end (E2E) autonomous driving
An end-to-end autonomous driving system refers to direct mapping from sensor data inputs (camera images, LiDAR, etc.) to control command outputs (steering, acceleration/deceleration, etc.). It first appeared in the ALVINN project in 1988. It uses cameras and laser rangefinders as input and a simple neural network to generate steering as output.
In early 2024, Tesla rolled out FSD V12.3, featuring an amazing intelligent driving level. The end-to-end autonomous driving solution garners widespread attention from OEMs and autonomous driving solution companies in China.
Compared with conventional multi-module solutions, the end-to-end autonomous driving solution integrates perception, prediction and planning into a single model, simplifying the solution structure. It can simulate human drivers making driving decisions directly according to visual inputs, effectively cope with long tail scenarios of modular solutions and improve the training efficiency and performance of models.
Li Auto's end-to-end solution
Li Auto believes that a complete end-to-end model should cover the whole process of perception, tracking, prediction, decision and planning, and it is the optimal solution to achieve L3 autonomous driving. In 2023, Li Auto pushed AD Max3.0, with overall framework reflecting the end-to-end concept but still a gap with a complete end-to-end solution. In 2024, Li Auto is expected to promote the system to become a complete end-to-end solution.
Li Auto's autonomous driving framework is shown below, consisting of two systems:
Fast system: System 1, Li Auto's existing end-to-end solution which is directly executed after perceiving the surroundings.
Slow system: System 2, a multimodal large language model that logically thinks and explores unknown environments to solve problems in unknown L4 scenarios.
In the process of promoting the end-to-end solution, Li Auto plans to unify the planning/forecast model and the perception model, and accomplish the end-to-end Temporal Planner on the original basis to integrate parking with driving.
The implementation of an end-to-end solution requires processes covering R&D team building, hardware facilities, data collection and processing, algorithm training and strategy customization, verification and evaluation, promotion and mass production. Some of the sore points in scenarios are as shown in the table:
The integrated training in end-to-end autonomous driving solutions requires massive data, so one of the difficulties it faces lies in data collection and processing.
First of all, it needs a long time and may channels to collect data, including driving data and scenario data such as roads, weather and traffic conditions. In actual driving, the data within the driver's front view is relatively easy to collect, but the surrounding information is hard to say.
During data processing, it is necessary to design data extraction dimensions, extract effective features from massive video clips, make statistics of data distribution, etc. to support large-scale data training.
DeepRoute
As of March 2024, DeepRoute.ai's end-to-end autonomous driving solution has been designated by Great Wall Motor and involved in the cooperation with NVIDIA. It is expected to adapt to NVIDIA Thor in 2025. In the planning of DeepRoute.ai, the transition from the conventional solution to the "end-to-end" autonomous driving solution will go through sensor pre-fusion, HD map removal, and integration of perception, decision and control.
GigaStudio
DriveDreamer, an autonomous driving model of GigaStudio, is capable of scenario generation, data generation, driving action prediction and so forth. In the scenario/data generation, it has two steps:
When involving single-frame structural conditions, guide DriveDreamer to generate driving scenario images, so that it can understand structural traffic constraints easily.
Extend its understanding to video generation. Using continuous traffic structure conditions, DriveDreamer outputs driving scene videos to further enhance its understanding of motion transformation.
In addition to autonomous vehicles, embodied robots are another mainstream scenario of end-to-end solutions. From end-to-end autonomous driving to robots, it is necessary to build a more universal world model to adapt to more complex and diverse real application scenarios. The development framework of mainstream AGI (General Artificial Intelligence) is divided into two stages:
Stage 1: the understanding and generation of basic foundation models are unified, and further combined with embodied artificial intelligence (embodied AI) to form a unified world model;
Stage 2: capabilities of world model + complex task planning and control, and abstract concept induction gradually evolve into the era of the interactive AGI 1.0.
In the landing process of the world model, the construction of an end-to-end VLA (Vision-Language-Action) autonomous system has become a crucial link. VLA, as the basic foundation model of embodied AI, can seamlessly link 3D perception, reasoning and action to form a generative world model, which is built on the 3D-based large language model (LLM) and introduces a set of interactive markers to interact with the environment.
As of April 2024, some manufacturers of humanoid robots adopting end-to-end solutions are as follows:
For example, Udeer*AI's Large Physical Language Model (LPLM) is an end-to-end embodied AI solution that uses a self-labeling mechanism to improve the learning efficiency and quality of the model from unlabeled data, thereby deepening the understanding of the world and enhancing the robot's generalization capabilities and environmental adaptability in cross-modal, cross-scene, and cross-industry scenarios.
LPLM abstracts the physical world and ensures that this kind of information is aligned with the abstract level of features in LLM. It explicitly models each entity in the physical world as a token, and encodes geometric, semantic, kinematic and intentional information.
In addition, LPLM adds 3D grounding to the encoding of natural language instructions, improving the accuracy of natural language to some extent. Its decoder can learn by constantly predicting the future, thus strengthening the ability of the model to learn from massive unlabeled data.