Triton Inference Server 

GitHub Support CommunityModel serving and monitoring

Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs. Triton supports an HTTP/REST and GRPC protocol that allows remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton is available as a shared library with a C API that allows the full functionality of Triton to be included directly in an application.

Features

Multiple deep-learning frameworks. Triton can manage any number and mix of models (limited by system disk and memory resources). Triton supports TensorRT, TensorFlow GraphDef, TensorFlow SavedModel, ONNX, PyTorch TorchScript and OpenVINO model formats. Both TensorFlow 1.x and TensorFlow 2.x are supported. Triton also supports TensorFlow-TensorRT and ONNX-TensorRT integrated models.

Concurrent model execution. Multiple models (or multiple instances of the same model) can run simultaneously on the same GPU or on multiple GPUs.

Dynamic batching. For models that support batching, Triton implements multiple scheduling and batching algorithms that combine individual inference requests together to improve inference throughput. These scheduling and batching decisions are transparent to the client requesting inference.

Extensible backends. In addition to deep-learning frameworks, Triton provides a backend API that allows Triton to be extended with any model execution logic implemented in Python or C++, while still benefiting from the CPU and GPU support, concurrent execution, dynamic batching and other features provided by Triton.

Model pipelines. Triton ensembles represents a pipeline of one or more models and the connection of input and output tensors between those models. A single inference request to an ensemble will trigger the execution of the entire pipeline.

HTTP/REST and GRPC inference protocols based on the community developed KFServing protocol.

A C API allows Triton to be linked directly into your application for edge and other in-process use cases.

Metrics indicating GPU utilization, server throughput, and server latency. The metrics are provided in Prometheus data format.

Official website

Tutorial and documentation

Enter your contact information to continue reading