Skip to content

deadlock; recursive locking #358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
johnr84 opened this issue Mar 23, 2020 · 9 comments
Closed

deadlock; recursive locking #358

johnr84 opened this issue Mar 23, 2020 · 9 comments
Labels
bug Something isn't working documentation Docs needed

Comments

@johnr84
Copy link

johnr84 commented Mar 23, 2020

What happened:

I am pretty new to splunk and deployed splunk connect for kubernetes(1.4.0) in the cluster and see the below error in the agent that runs on the master server. Kube API server logs are not pushed to the splunk server.

2020-03-23 08:31:30 +0000 [warn]: #0 dump an error event: error_class=ThreadError error="deadlock; recursive locking" location="/usr/share/gems/gems/fluent-plugin-concat-2.4.0/lib/fluent/plugin/filter_concat.rb:189:in `synchronize'" tag="tail.containers.var.log.containers.kube-apiserver-k8s1m_kube-system_kube-apiserver-f71d1b0e611b1f82d45637a2aaae75b5a0849b966bab165f8fa3078194b55b1a.log" time=2020-03-23 08:31:25.150968430 +0000 record={"log"=>"I0323 08:31:25.150846 1 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io\n", "stream"=>"stderr", "source"=>"/var/log/containers/kube-apiserver-k8s1m_kube-system_kube-apiserver-f71d1b0e611b1f82d45637a2aaae75b5a0849b966bab165f8fa3078194b55b1a.log"}

2020-03-23 08:31:30 +0000 [info]: #0 Timeout flush: tail.containers.var.log.containers.kube-apiserver-k8s1m_kube-system_kube-apiserver-f71d1b0e611b1f82d45637a2aaae75b5a0849b966bab165f8fa3078194b55b1a.log:stderr

What you expected to happen:

API server logs gets pushed to the splunk server

Anything else we need to know?:

Application logs from other containers are being pushed to the splunk server.

Environment:

  • Kubernetes version (use kubectl version): 1,15,4
  • Ruby version (use ruby --version): ruby 2.5.5p157
  • OS (e.g: cat /etc/os-release): Ubuntu 18.04.2 LTS
  • Splunk version: 8,0,1
  • SCK Version - 1,4,0
@matthewmodestino
Copy link
Collaborator

matthewmodestino commented Mar 25, 2020

This is definitely a bug in how we use fluentd Labels imo. Marking as such.

See workaround with concat dev suggestions here:

fluent-plugins-nursery/fluent-plugin-concat#69 (comment)

@matthewmodestino matthewmodestino added bug Something isn't working documentation Docs needed labels Mar 25, 2020
@matthewmodestino
Copy link
Collaborator

matthewmodestino commented Mar 28, 2020

https://github.com/fluent-plugins-nursery/fluent-plugin-concat#configuration

timeout_label
timeout_label (string) (optional)
The label name to handle events caused by timeout.

Going to remove all the built in concat rules, looks like we need to send timeouts to another label?

<label @SPLUNK>
      # = filters for container logs =
      <filter tail.containers.var.log.containers.dns-controller*dns-controller*.log>
        @type concat
        key log
        timeout_label @SPLUNK
        stream_identity_key stream
        multiline_start_regexp /^\w[0-1]\d[0-3]\d/
        flush_interval 5s
        separator ""
        use_first_timestamp true
      </filter>
      <filter tail.containers.var.log.containers.kube-dns*sidecar*.log>
        @type concat
        key log
        timeout_label @SPLUNK
        stream_identity_key stream
        multiline_start_regexp /^\w[0-1]\d[0-3]\d/
        flush_interval 5s
        separator ""
        use_first_timestamp true
      </filter>
      <filter tail.containers.var.log.containers.kube-dns*dnsmasq*.log>
        @type concat
        key log
        timeout_label @SPLUNK
        stream_identity_key stream
        multiline_start_regexp /^\w[0-1]\d[0-3]\d/
        flush_interval 5s
        separator ""
        use_first_timestamp true
      </filter>
      <filter tail.containers.var.log.containers.kube-apiserver*kube-apiserver*.log>
        @type concat
        key log
        timeout_label @SPLUNK
        stream_identity_key stream
        multiline_start_regexp /^\w[0-1]\d[0-3]\d/
        flush_interval 5s
        separator ""
        use_first_timestamp true
      </filter>
      <filter tail.containers.var.log.containers.kube-controller-manager*kube-controller-manager*.log>
        @type concat
        key log
        timeout_label @SPLUNK
        stream_identity_key stream
        multiline_start_regexp /^\w[0-1]\d[0-3]\d/
        flush_interval 5s
        separator ""
        use_first_timestamp true
      </filter>
      <filter tail.containers.var.log.containers.kube-dns-autoscaler*autoscaler*.log>
        @type concat
        key log
        timeout_label @SPLUNK
        stream_identity_key stream
        multiline_start_regexp /^\w[0-1]\d[0-3]\d/
        flush_interval 5s
        separator ""
        use_first_timestamp true
      </filter>
      <filter tail.containers.var.log.containers.kube-proxy*kube-proxy*.log>
        @type concat
        key log
        timeout_label @SPLUNK
        stream_identity_key stream
        multiline_start_regexp /^\w[0-1]\d[0-3]\d/
        flush_interval 5s
        separator ""
        use_first_timestamp true
      </filter>
      <filter tail.containers.var.log.containers.kube-scheduler*kube-scheduler*.log>
        @type concat
        key log
        timeout_label @SPLUNK
        stream_identity_key stream
        multiline_start_regexp /^\w[0-1]\d[0-3]\d/
        flush_interval 5s
        separator ""
        use_first_timestamp true
      </filter>
      <filter tail.containers.var.log.containers.kube-dns*kubedns*.log>
        @type concat
        key log
        timeout_label @SPLUNK
        stream_identity_key stream
        multiline_start_regexp /^\w[0-1]\d[0-3]\d/
        flush_interval 5s
        separator ""
        use_first_timestamp true
      </filter>

@matthewmodestino
Copy link
Collaborator

matthewmodestino commented Mar 29, 2020

i was able to workaround this by following the thread i posted above over on the concat repo.

I started by removing all the shipped concat filters from the logging and umbrella chart values.yaml, (these should be optional anyways) then updating configmap sources that we need concat for to point to @label CONCAT (containers and files sources currently allow you to set concat)

<source>
      @id containers.log
      @type tail
      @label @CONCAT
      tag tail.containers.*
      path {{ .Values.fluentd.path | default "/var/log/containers/*.log" }}
      {{- if .Values.fluentd.exclude_path }}
      exclude_path {{ .Values.fluentd.exclude_path | toJson }}
      {{- end }}
      pos_file /var/log/splunk-fluentd-containers.log.pos
      path_key source
      read_from_head true
      <parse>
      {{- if eq .Values.containers.logFormatType "cri" }}
        @type regexp
        expression /^(?<time>.+) (?<stream>stdout|stderr) [^ ]* (?<log>.*)$/
        time_format  {{ .Values.containers.logFormat | default "%Y-%m-%dT%H:%M:%S.%N%:z" }}
      {{- else if eq .Values.containers.logFormatType "json" }}
        @type json
        time_format %Y-%m-%dT%H:%M:%S.%NZ
      {{- end }}
        time_key time
        time_type string
        localtime false
      </parse>
    </source>

Then moved the concat filter outside the @SPLUNK label in output.conf

 <label @CONCAT>
     # = filters for container logs =
      {{- range $name, $logDef := .Values.logs }}
      {{- if and $logDef.from.pod $logDef.multiline }}
      <filter tail.containers.var.log.containers.{{ $logDef.from.pod }}*{{ or $logDef.from.container $name }}*.log>
        @type concat
        key log
        timeout_label @SPLUNK
        stream_identity_key stream
        multiline_start_regexp {{ $logDef.multiline.firstline }}
        flush_interval {{ $logDef.multiline.flushInterval | default "2s" }}
        separator "\n"
        use_first_timestamp true
      </filter>
      {{- end }}
      {{- end }}

      <match **>
        @type relabel
        @label @SPLUNK
      </match>
    </label>
    <label @SPLUNK>
      # Enrich log with k8s metadata
      <filter tail.containers.**>
        @type kubernetes_metadata
        annotation_match [ "^splunk\.com" ]
        de_dot false
      </filter>
      <filter tail.containers.**>
        @type record_transformer
        enable_ruby
        <record>
...
...

Because not all logs need the same concat settings (ie. separator, etc - in the example above my pod needs separator "\n", some don't), I believe we need to expose more multiline settings in the helm chart instead of rendering all concat filters with the same settings block.

So what we need to solve this:

  • Correct the configmap to update source.containers.conf & source.files.conf to point to a new label called ie. CONCAT which contains the multiline logic in output.conf
  • Expose more multiline settings in the helm chart, especially separator as not all concat rules need the same treatement.
  • Remove or make our shipped concat filters optional

@matthewmodestino
Copy link
Collaborator

oh and you'll notice i put the timeout to 2s. I dont think I know of a concat situation in normal ops where we need to wait that long for more lines. I actually think it should be 1s but we should test whether events ever span more than a whole second apart.

@sfrancies
Copy link

sfrancies commented Apr 3, 2020

I got the same warn msg :

[warn]: #0 dump an error event: error_class=ThreadError error="deadlock; recursive locking" location="/usr/share/gems/gems/fluent-plugin-concat-2.4.0/lib/fluent/plugin/filter_concat.rb:189:in `synchronize'" tag="tail.containers.var.log.containers.kube-proxy-w2hcg_kube-system_kube-proxy-f5c846203755b66df23ed167d65b907a59d8a93893bfad14b2da804563d62a57.log" time=2020-04-03 19:12:27.633148152 +0000 record={"log"=>"I0403 19:12:27.633038 1 bounded_frequency_runner.go:221] sync-runner: ran, next possible in 0s, periodic in 30s\n", "stream"=>"stderr", "source"=>"/var/log/containers/kube-proxy-w2hcg_kube-system_kube-proxy-f5c846203755b66df23ed167d65b907a59d8a93893bfad14b2da804563d62a57.log"}
2020-04-03 19:12:32 +0000 [info]: #0 Timeout flush: tail.containers.var.log.containers.kube-proxy-w2hcg_kube-system_kube-proxy-f5c846203755b66df23ed167d65b907a59d8a93893bfad14b2da804563d62a57.log:stderr

Will there be a fix for it soon ?

@szymonpk
Copy link

szymonpk commented Apr 9, 2020

@matthewmodestino any chance you will make a MR with this fix? Since you have figured out the whole issue. We are experiencing the same issue with EKS, and I would like to avoid forking the chart.

@matthewmodestino
Copy link
Collaborator

matthewmodestino commented Apr 9, 2020

yeah, i’ll try and get it started later today. Do you think I should make the default concat filters optional? eg. ability to toggle them all on or all off?

I think thats what I’ll try because I don’t believe they apply in all enviros...

@szymonpk
Copy link

@matthewmodestino if it is possible to make it optional, the sure. Looks like not too many people experience the same issue. I am a fan of occam's razor - "Entities should not be multiplied without necessity." ;-)

BTW. Thanks for looking into this.

This was referenced Apr 10, 2020
@matthewmodestino matthewmodestino linked a pull request Apr 10, 2020 that will close this issue
3 tasks
@matthewmodestino
Copy link
Collaborator

@szymonpk PR is in for review. once the team has a chance to take a look i will add another to expose the "separator" option...and will look for any others we think we need to make multiline logs pretty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Docs needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants