Concurrent model execution. Multiple models (or multiple instances of the same model) can run simultaneously on the same GPU or on multiple GPUs.
Dynamic batching. For models that support batching, Triton implements multiple scheduling and batching algorithms that combine individual inference requests together to improve inference throughput. These scheduling and batching decisions are transparent to the client requesting inference.
Extensible backends. In addition to deep-learning frameworks, Triton provides a backend API that allows Triton to be extended with any model execution logic implemented in Python or C++, while still benefiting from the CPU and GPU support, concurrent execution, dynamic batching and other features provided by Triton.
Model pipelines. Triton ensembles represents a pipeline of one or more models and the connection of input and output tensors between those models. A single inference request to an ensemble will trigger the execution of the entire pipeline.
HTTP/REST and GRPC inference protocols based on the community developed KFServing protocol.
A C API allows Triton to be linked directly into your application for edge and other in-process use cases.
Metrics indicating GPU utilization, server throughput, and server latency. The metrics are provided in Prometheus data format.