Skip to content

Commit

Permalink
Adding OpenTelemetry Batch Span Processor (#6842)
Browse files Browse the repository at this point in the history
Adding OpenTelemetry Batch Span Processor
---------

Co-authored-by: Theo Clark <theoclark101@gmail.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
  • Loading branch information
3 people authored Feb 1, 2024
1 parent f345bbb commit 8f98789
Show file tree
Hide file tree
Showing 8 changed files with 493 additions and 72 deletions.
82 changes: 81 additions & 1 deletion docs/user_guide/trace.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!--
# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright 2019-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -456,6 +456,46 @@ flag as follows:
$ tritonserver --trace-config mode=opentelemetry \
--trace-config opentelemetry,url=<endpoint> ...
```

Triton's OpenTelemetry trace mode uses
[Batch Span Processor](https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/#batch-span-processor),
which batches ended spans and sends them in bulk. Batching helps
with data compression and reduces the number of outgoing connections
required to transmit the data. This processor supports both size and
time based batching. Size-based batching is controlled by 2 parameters:
`bsp_max_export_batch_size` and `bsp_max_queue_size`, while time-based batching
is controlled by `bsp_schedule_delay`. Collected spans will be exported when
the batch size reaches `bsp_max_export_batch_size`, or delay since last export
reaches `bsp_schedule_delay`, whatever comes first. Additionally, user should
make sure that `bsp_max_export_batch_size` is always less than
`bsp_max_queue_size`, otherwise the excessive spans will be dropped
and trace data will be lost.

Default parameters for the Batch Span Processor are provided in
[`OpenTelemetry trace APIs settings`](#opentelemetry-trace-apis-settings).
As a general recommendation, make sure that `bsp_max_queue_size` is large enough
to hold all collected spans, and `bsp_schedule_delay` does not cause frequent
exports, which will affect Triton Server's latency. A minimal Triton trace
consists of 3 spans: top level span, model span, and compute span.

* __Top level span__: The top-level span collects timestamps for when
request was received by Triton, and when the response was sent. Any Triton
trace contains only 1 top level span.
* __Model span__: Model spans collect information, when request for
this model was started, when it was placed in a queue, and when it was ended.
A minimal Triton trace contains 1 model span.
* __Compute span__: Compute spans record compute timestamps. A minimal
Triton trace contains 1 compute span.

The total amount of spans depends on the complexity of your model.
A general rule is any base model - a single model that performs computations -
produces 1 model span and one compute span. For ensembles, every composing
model produces model and compute spans in addition to one model span for the
ensemble. [BLS](#tracing-for-bls-models) models produce the same number of
model and compute spans as the total amount of models involved in the BLS request,
including the main BLS model.


### Differences in trace contents from Triton's trace [output](#json-trace-output)

OpenTelemetry APIs produce [spans](https://opentelemetry.io/docs/concepts/observability-primer/#spans)
Expand Down Expand Up @@ -509,6 +549,46 @@ The following table shows available OpenTelemetry trace APIs settings for
environment variable.
</td>
</tr>
<tr>
<td><a href="https://opentelemetry.io/docs/specs/otel/trace/sdk/#batching-processor">
Batch Span Processor</a>
</td>
<td></td><td></td>
</tr>
<tr>
<td><code>bsp_max_queue_size</code></td>
<td align="center">2048</td>
<td>
Maximum queue size. <br/>
This setting can also be specified through <br/>
<a href="https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/#batch-span-processor">
OTEL_BSP_MAX_QUEUE_SIZE</a>
environment variable.
</td>
</tr>
<tr>
<td><code>bsp_schedule_delay</code></td>
<td align="center">5000</td>
<td>
Delay interval (in milliseconds) between two consecutive exports. <br/>
This setting can also be specified through <br/>
<a href="https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/#batch-span-processor">
OTEL_BSP_SCHEDULE_DELAY</a>
environment variable.
</td>
</tr>
<tr>
<td><code>bsp_max_export_batch_size</code></td>
<td align="center">512</td>
<td>
Maximum batch size. Must be less than or equal to
<code>bsp_max_queue_size</code>.<br/>
This setting can also be specified through <br/>
<a href="https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/#batch-span-processor">
OTEL_BSP_MAX_EXPORT_BATCH_SIZE</a>
environment variable.
</td>
</tr>
</tbody>
</table>

Expand Down
98 changes: 96 additions & 2 deletions qa/L0_cmdline_trace/test.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/bin/bash
# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright 2019-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand All @@ -25,6 +25,19 @@
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

# ============================= Helpers =======================================
function assert_server_startup_failed() {
if [ "$SERVER_PID" != "0" ]; then
echo -e "\n***\n***Fail: Server start should have failed $SERVER\n***"
cat $SERVER_LOG
set -e
kill $SERVER_PID
wait $SERVER_PID
set +e
exit 1
fi
}

TRACE_SUMMARY=../common/trace_summary.py
CLIENT_SCRIPT=trace_client.py

Expand Down Expand Up @@ -618,11 +631,92 @@ set -e
kill $SERVER_PID
wait $SERVER_PID

set +e

################################################################################
# The following set of tests checks that tritonserver gracefully handles #
# bad OpenTelemetry BatchSpanProcessor parameters, provided through #
# environment variables, or tritonserver's options. #
################################################################################
export OTEL_BSP_MAX_QUEUE_SIZE="bad_value"

SERVER_ARGS="--trace-config mode=opentelemetry --model-repository=$MODELSDIR"
SERVER_LOG="./inference_server_trace_config_flag.log"
run_server
assert_server_startup_failed

if [ `grep -c "Bad option: \"OTEL_BSP_MAX_QUEUE_SIZE\"" $SERVER_LOG` != "1" ]; then
cat $SERVER_LOG
echo -e "\n***\n*** Test Failed\n***"
RET=1
fi

unset OTEL_BSP_MAX_QUEUE_SIZE

export OTEL_BSP_SCHEDULE_DELAY="bad_value"
run_server
assert_server_startup_failed

if [ `grep -c "Bad option: \"OTEL_BSP_SCHEDULE_DELAY\"" $SERVER_LOG` != "1" ]; then
cat $SERVER_LOG
echo -e "\n***\n*** Test Failed\n***"
RET=1
fi

unset OTEL_BSP_SCHEDULE_DELAY

export OTEL_BSP_MAX_EXPORT_BATCH_SIZE="bad_value"
run_server
assert_server_startup_failed

if [ `grep -c "Bad option: \"OTEL_BSP_MAX_EXPORT_BATCH_SIZE\"" $SERVER_LOG` != "1" ]; then
cat $SERVER_LOG
echo -e "\n***\n*** Test Failed\n***"
RET=1
fi

unset OTEL_BSP_MAX_EXPORT_BATCH_SIZE

SERVER_ARGS="--model-repository=$MODELSDIR --trace-config mode=opentelemetry \
--trace-config opentelemetry,bsp_max_queue_size=bad_value"
SERVER_LOG="./inference_server_trace_config_flag.log"
run_server
assert_server_startup_failed

if [ `grep -c "Bad option: \"--trace-config opentelemetry,bsp_max_queue_size\"" $SERVER_LOG` != "1" ]; then
cat $SERVER_LOG
echo -e "\n***\n*** Test Failed\n***"
RET=1
fi

SERVER_ARGS="--model-repository=$MODELSDIR --trace-config mode=opentelemetry \
--trace-config opentelemetry,bsp_schedule_delay=bad_value"
SERVER_LOG="./inference_server_trace_config_flag.log"
run_server
assert_server_startup_failed

if [ `grep -c "Bad option: \"--trace-config opentelemetry,bsp_schedule_delay\"" $SERVER_LOG` != "1" ]; then
cat $SERVER_LOG
echo -e "\n***\n*** Test Failed\n***"
RET=1
fi

SERVER_ARGS="--model-repository=$MODELSDIR --trace-config mode=opentelemetry \
--trace-config opentelemetry,bsp_max_export_batch_size=bad_value"
SERVER_LOG="./inference_server_trace_config_flag.log"
run_server
assert_server_startup_failed

if [ `grep -c "Bad option: \"--trace-config opentelemetry,bsp_max_export_batch_size\"" $SERVER_LOG` != "1" ]; then
cat $SERVER_LOG
echo -e "\n***\n*** Test Failed\n***"
RET=1
fi

if [ $RET -eq 0 ]; then
echo -e "\n***\n*** Test Passed\n***"
else
echo -e "\n***\n*** Test FAILED\n***"
fi


exit $RET
29 changes: 17 additions & 12 deletions qa/L0_trace/opentelemetry_unittest.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright 2023-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -250,9 +250,9 @@ def _verify_contents(self, spans, expected_counts):
span_names = []
for span in spans:
# Check that collected spans have proper events recorded
span_name = span[0]["name"]
span_name = span["name"]
span_names.append(span_name)
span_events = span[0]["events"]
span_events = span["events"]
event_names_only = [event["name"] for event in span_events]
self._check_events(span_name, event_names_only)

Expand Down Expand Up @@ -283,13 +283,13 @@ def _verify_nesting(self, spans, expected_parent_span_dict):
"""
seen_spans = {}
for span in spans:
cur_span = span[0]["spanId"]
seen_spans[cur_span] = span[0]["name"]
cur_span = span["spanId"]
seen_spans[cur_span] = span["name"]

parent_child_dict = {}
for span in spans:
cur_parent = span[0]["parentSpanId"]
cur_span = span[0]["name"]
cur_parent = span["parentSpanId"]
cur_span = span["name"]
if cur_parent in seen_spans.keys():
parent_name = seen_spans[cur_parent]
if parent_name not in parent_child_dict:
Expand Down Expand Up @@ -377,16 +377,21 @@ def _test_trace(
"""
time.sleep(COLLECTOR_TIMEOUT)
traces = self._parse_trace_log(self.filename)
self.assertEqual(len(traces), 1, "Unexpected number of traces collected")
expected_traces_number = 1
self.assertEqual(
len(traces),
expected_traces_number,
"Unexpected number of traces collected. Expected {}, but got {}".format(
expected_traces_number, len(traces)
),
)
self._test_resource_attributes(
traces[0]["resourceSpans"][0]["resource"]["attributes"]
)

parsed_spans = [
entry["scopeSpans"][0]["spans"] for entry in traces[0]["resourceSpans"]
]
parsed_spans = traces[0]["resourceSpans"][0]["scopeSpans"][0]["spans"]
root_span = [
entry[0] for entry in parsed_spans if entry[0]["name"] == "InferRequest"
entry for entry in parsed_spans if entry["name"] == "InferRequest"
][0]
self.assertEqual(len(parsed_spans), expected_number_of_spans)

Expand Down
Loading

0 comments on commit 8f98789

Please sign in to comment.