Triton Inference Server
Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs. Triton supports an HTTP/REST and GRPC protocol that allows remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton is available as a shared library with a C API that allows the full functionality of Triton to be included directly in an application.
Concurrent model execution. Multiple models (or multiple instances of the same model) can run simultaneously on the same GPU or on multiple GPUs.
Dynamic batching. For models that support batching, Triton implements multiple scheduling and batching algorithms that combine individual inference requests together to improve inference throughput. These scheduling and batching decisions are transparent to the client requesting inference.
Extensible backends. In addition to deep-learning frameworks, Triton provides a backend API that allows Triton to be extended with any model execution logic implemented in Python or C++, while still benefiting from the CPU and GPU support, concurrent execution, dynamic batching and other features provided by Triton.
Model pipelines. Triton ensembles represents a pipeline of one or more models and the connection of input and output tensors between those models. A single inference request to an ensemble will trigger the execution of the entire pipeline.
HTTP/REST and GRPC inference protocols based on the community developed KFServing protocol.
A C API allows Triton to be linked directly into your application for edge and other in-process use cases.
Metrics indicating GPU utilization, server throughput, and server latency. The metrics are provided in Prometheus data format.