Skip to content

Latest commit

 

History

History
205 lines (177 loc) · 14.5 KB

collectd-spark.md

File metadata and controls

205 lines (177 loc) · 14.5 KB

collectd/spark

Collects metrics about a Spark cluster using the collectd Spark Python plugin. Also see https://github.com/signalfx/integrations/tree/master/collectd-spark.

You have to specify distinct monitor configurations and discovery rules for master and worker processes. For the master configuration, set isMaster to true.

We only support HTTP endpoints for now.

When running Spark on Apache Hadoop / Yarn, this integration is only capable of reporting application metrics from the master node. Please use the collectd/hadoop monitor to report on the health of the cluster.

An example configuration for monitoring applications on Yarn

monitors:
  - type: collectd/spark
    host: 000.000.000.000
    port: 8088
    clusterType: Yarn
    isMaster: true
    collectApplicationMetrics: true

Monitor Type: collectd/spark

Monitor Source Code

Accepts Endpoints: Yes

Multiple Instances Allowed: Yes

Configuration

Config option Required Type Description
host yes string
port yes integer
isMaster no bool Set to true when monitoring a master Spark node (default: false)
clusterType yes string Should be one of Standalone or Mesos or Yarn. Cluster metrics will not be collected on Yarn. Please use the collectd/hadoop monitor to gain insights to your cluster's health.
collectApplicationMetrics no bool (default: false)
enhancedMetrics no bool (default: false)

Metrics

The following table lists the metrics available for this monitor. Metrics that are marked as Included are standard metrics and are monitored by default.

Name Type Included Description
counter.HiveExternalCatalog.counter.HiveClientCalls counter Total number of client calls sent to Hive for query processing
counter.HiveExternalCatalog.fileCacheHits counter Total number of file level cache hits occurred
counter.HiveExternalCatalog.filesDiscovered counter Total number of files discovered
counter.HiveExternalCatalog.parallelListingJobCount counter Total number of Hive-specific jobs running in parallel
counter.HiveExternalCatalog.partitionsFetched counter Total number of partitions fetched
counter.spark.driver.completed_tasks counter Total number of completed tasks in driver mapped to a particular application
counter.spark.driver.disk_used counter Amount of disk used by driver mapped to a particular application
counter.spark.driver.failed_tasks counter Total number of failed tasks in driver mapped to a particular application
counter.spark.driver.memory_used counter Amount of memory used by driver mapped to a particular application
counter.spark.driver.total_duration counter Fraction of time spent by driver mapped to a particular application
counter.spark.driver.total_input_bytes counter Number of input bytes in driver mapped to a particular application
counter.spark.driver.total_shuffle_read counter Size read during a shuffle in driver mapped to a particular application
counter.spark.driver.total_shuffle_write counter Size written to during a shuffle in driver mapped to a particular application
counter.spark.driver.total_tasks counter Total number of tasks in driver mapped to a particular application
counter.spark.executor.completed_tasks counter Completed tasks across executors working for a particular application
counter.spark.executor.disk_used counter Amount of disk used across executors working for a particular application
counter.spark.executor.failed_tasks counter Failed tasks across executors working for a particular application
counter.spark.executor.memory_used counter Amount of memory used across executors working for a particular application
counter.spark.executor.total_duration counter Fraction of time spent across executors working for a particular application
counter.spark.executor.total_input_bytes counter Number of input bytes across executors working for a particular application
counter.spark.executor.total_shuffle_read counter Size read during a shuffle in a particular application's executors
counter.spark.executor.total_shuffle_write counter Size written to during a shuffle in a particular application's executors
counter.spark.executor.total_tasks counter Total tasks across executors working for a particular application
counter.spark.streaming.num_processed_records counter Number of processed records in a streaming application
counter.spark.streaming.num_received_records counter Number of received records in a streaming application
counter.spark.streaming.num_total_completed_batches counter Number of batches completed in a streaming application
gauge.jvm.MarkSweepCompact.count gauge Garbage collection count
gauge.jvm.MarkSweepCompact.time gauge Garbage collection time
gauge.jvm.heap.committed gauge Amount of committed heap memory (in MB)
gauge.jvm.heap.used gauge Amount of used heap memory (in MB)
gauge.jvm.non-heap.committed gauge Amount of committed non-heap memory (in MB)
gauge.jvm.non-heap.used gauge Amount of used non-heap memory (in MB)
gauge.jvm.pools.Code-Cache.committed gauge Amount of memory committed for compilation and storage of native code
gauge.jvm.pools.Code-Cache.used gauge Amount of memory used to compile and store native code
gauge.jvm.pools.Compressed-Class-Space.committed gauge Amount of memory committed for compressing a class object
gauge.jvm.pools.Compressed-Class-Space.used gauge Amount of memory used to compress a class object
gauge.jvm.pools.Eden-Space.committed gauge Amount of memory committed for the initial allocation of objects
gauge.jvm.pools.Eden-Space.used gauge Amount of memory used for the initial allocation of objects
gauge.jvm.pools.Metaspace.committed gauge Amount of memory committed for storing classes and classloaders
gauge.jvm.pools.Metaspace.used gauge Amount of memory used to store classes and classloaders
gauge.jvm.pools.Survivor-Space.committed gauge Amount of memory committed specifically for objects that have survived GC of the Eden Space
gauge.jvm.pools.Survivor-Space.used gauge Amount of memory used for objects that have survived GC of the Eden Space
gauge.jvm.pools.Tenured-Gen.committed gauge Amount of memory committed to store objects that have lived in the survivor space for a given period of time
gauge.jvm.pools.Tenured-Gen.used gauge Amount of memory used for objects that have lived in the survivor space for a given period of time
gauge.jvm.total.committed gauge Amount of committed JVM memory (in MB)
gauge.jvm.total.used gauge Amount of used JVM memory (in MB)
gauge.master.aliveWorkers gauge Total functioning workers
gauge.master.apps gauge Total number of active applications in the spark cluster
gauge.master.waitingApps gauge Total number of waiting applications in the spark cluster
gauge.master.workers gauge Total number of workers in spark cluster
gauge.spark.driver.active_tasks gauge Total number of active tasks in driver mapped to a particular application
gauge.spark.driver.max_memory gauge Maximum memory used by driver mapped to a particular application
gauge.spark.driver.rdd_blocks gauge Number of RDD blocks in the driver mapped to a particular application
gauge.spark.executor.active_tasks gauge Total number of active tasks across all executors working for a particular application
gauge.spark.executor.count gauge Total number of executors performing for an active application in the spark cluster
gauge.spark.executor.max_memory gauge Max memory across all executors working for a particular application
gauge.spark.executor.rdd_blocks gauge Number of RDD blocks across all executors working for a particular application
gauge.spark.job.num_active_stages gauge Total number of active stages for an active application in the spark cluster
gauge.spark.job.num_active_tasks gauge Total number of active tasks for an active application in the spark cluster
gauge.spark.job.num_completed_stages gauge Total number of completed stages for an active application in the spark cluster
gauge.spark.job.num_completed_tasks gauge Total number of completed tasks for an active application in the spark cluster
gauge.spark.job.num_failed_stages gauge Total number of failed stages for an active application in the spark cluster
gauge.spark.job.num_failed_tasks gauge Total number of failed tasks for an active application in the spark cluster
gauge.spark.job.num_skipped_stages gauge Total number of skipped stages for an active application in the spark cluster
gauge.spark.job.num_skipped_tasks gauge Total number of skipped tasks for an active application in the spark cluster
gauge.spark.job.num_tasks gauge Total number of tasks for an active application in the spark cluster
gauge.spark.num_active_stages gauge Total number of active stages for an active application in the spark cluster
gauge.spark.num_running_jobs gauge Total number of running jobs for an active application in the spark cluster
gauge.spark.stage.disk_bytes_spilled gauge Actual size written to disk for an active application in the spark cluster
gauge.spark.stage.executor_run_time gauge Fraction of time spent by (and averaged across) executors for a particular application
gauge.spark.stage.input_bytes gauge Input size for a particular application
gauge.spark.stage.input_records gauge Input records received for a particular application
gauge.spark.stage.memory_bytes_spilled gauge Size spilled to disk from memory for an active application in the spark cluster
gauge.spark.stage.output_bytes gauge Output size for a particular application
gauge.spark.stage.output_records gauge Output records written to for a particular application
gauge.spark.stage.shuffle_read_bytes gauge Read size during shuffle phase for a particular application
gauge.spark.stage.shuffle_read_records gauge Number of records read during shuffle phase for a particular application
gauge.spark.stage.shuffle_write_bytes gauge Size written during shuffle phase for a particular application
gauge.spark.stage.shuffle_write_records gauge Number of records written to during shuffle phase for a particular application
gauge.spark.streaming.avg_input_rate gauge Average input rate of records across retained batches in a streaming application
gauge.spark.streaming.avg_processing_time gauge Average processing time in a streaming application
gauge.spark.streaming.avg_scheduling_delay gauge Average scheduling delay in a streaming application
gauge.spark.streaming.avg_total_delay gauge Average total delay in a streaming application
gauge.spark.streaming.num_active_batches gauge Number of active batches in a streaming application
gauge.spark.streaming.num_inactive_receivers gauge Number of inactive receivers in a streaming application
gauge.worker.coresFree gauge Total cores free for a particular worker process
gauge.worker.coresUsed gauge Total cores used by a particular worker process
gauge.worker.executors gauge Total number of executors for a particular worker process
gauge.worker.memFree_MB gauge Total memory free for a particular worker process
gauge.worker.memUsed_MB gauge Memory used by a particular worker process

To specify custom metrics you want to monitor, add a metricsToInclude filter to the agent configuration, as shown in the code snippet below. The snippet lists all available custom metrics. You can copy and paste the snippet into your configuration file, then delete any custom metrics that you do not want sent.

Note that some of the custom metrics require you to set a flag as well as add them to the list. Check the monitor configuration file to see if a flag is required for gathering additional metrics.

metricsToInclude:
  - metricNames:
    - counter.HiveExternalCatalog.counter.HiveClientCalls
    - counter.HiveExternalCatalog.fileCacheHits
    - counter.HiveExternalCatalog.filesDiscovered
    - counter.HiveExternalCatalog.parallelListingJobCount
    - counter.HiveExternalCatalog.partitionsFetched
    - counter.spark.driver.completed_tasks
    - counter.spark.driver.failed_tasks
    - counter.spark.driver.total_duration
    - counter.spark.executor.completed_tasks
    - counter.spark.executor.failed_tasks
    - counter.spark.executor.total_duration
    - counter.spark.executor.total_tasks
    - gauge.jvm.MarkSweepCompact.count
    - gauge.jvm.MarkSweepCompact.time
    - gauge.jvm.pools.Code-Cache.committed
    - gauge.jvm.pools.Code-Cache.used
    - gauge.jvm.pools.Compressed-Class-Space.committed
    - gauge.jvm.pools.Compressed-Class-Space.used
    - gauge.jvm.pools.Eden-Space.committed
    - gauge.jvm.pools.Eden-Space.used
    - gauge.jvm.pools.Metaspace.committed
    - gauge.jvm.pools.Metaspace.used
    - gauge.jvm.pools.Survivor-Space.committed
    - gauge.jvm.pools.Survivor-Space.used
    - gauge.jvm.pools.Tenured-Gen.committed
    - gauge.jvm.pools.Tenured-Gen.used
    - gauge.spark.driver.active_tasks
    - gauge.spark.driver.rdd_blocks
    - gauge.spark.executor.active_tasks
    - gauge.spark.executor.rdd_blocks
    - gauge.spark.stage.shuffle_read_bytes
    - gauge.spark.stage.shuffle_read_records
    - gauge.spark.stage.shuffle_write_bytes
    - gauge.spark.stage.shuffle_write_records
    monitorType: collectd/spark