|
Resource Integration
Integrate multiple types of resources such as GPU, CPU, storage and network to build a high-speed parallel storage system and high-speed InfiniBand network, forming a heterogeneous GPU computing resource pool. The resource pool supports flexible cross-region allocation, and builds an efficient, low-latency dedicated AI network environment. It adapts to model parameter storage requirements, ensures efficient access of massive data and multi-server communication, greatly improves data transmission speed, and guarantees safe and stable data transmission as well as task execution.
|
|
Distributed Scheduling
Based on container technology, it realizes efficient scheduling of various resource types and automatically allocates and manages GPU computing resources, improving the efficiency of resource and task scheduling. The system also provides resource group and priority configuration functions, effectively shortening the scheduling path of data transmission to meet the training and inference demands of large language models. In addition, the platform supports continuous operation of model fine-tuning and inference services.
|
|
Heterogeneous Support
Realize unified management of multiple heterogeneous computing resources, integrate mainstream domestic and overseas resources such as GPU, TPU and DPU, and create a centralized computing power resource library. According to the demands of different computing tasks, the system can flexibly schedule and allocate computing resources, providing multiple deployment options including computing resource groups, whole-server rental and card-level application.
|
|
Training and Inference Integration
Provide comprehensive AI services including data labeling, dataset management, algorithm construction, model training, model optimization, model management and deployment inference. The platform is built-in with a variety of commonly used GPU function libraries and toolkits, supporting mainstream training frameworks such as TensorFlow, PyTorch and PaddlePaddle.
|
|
AI Repository
The platform supports image repository, algorithm repository and data sample repository to store data and codes required for user training and inference.
|