Discussion:
[Bro] BRO Logger crashing due to large DNS log files
Ron McClellan
2018-08-19 15:12:17 UTC
Permalink
All,

Having an issue with the bro logger crashing due to large volumes of DNS log traffic, 20-30GB an hour. This is completely a local configuration, on a system with super-fast flash storage, 64 cores, 256GB RAM running BRO 2.5.4. If I disable DNS logging, everything works fine without issue. When I enable it, I get the results below. I thought it might be an issue with gzipping the old logs, so I replaced the standard gzip with pigz and I can manually compress the 30+ gig files in seconds, so don't think that is the issue. I also tried pinning dedicated cores to the logger, currently 6 cores, which should be way more than enough. Any thoughts or suggestions.

Thanks,

Ron

current]# ll -h
total 43G
-rw-r--r--. 1 root root 3.2K Aug 18 12:00 capture_loss-18-08-18_11.00.00.log
-rw-r--r--. 1 root root 3.2K Aug 18 12:18 capture_loss-18-08-18_12.00.00.log
-rw-r--r--. 1 root root 2.3M Aug 18 12:00 communication-18-08-18_11.00.00.log
-rw-r--r--. 1 root root 1.4M Aug 18 12:18 communication-18-08-18_12.00.00.log
-rw-r--r--. 1 root root 4.8K Aug 18 12:18 communication.log
-rw-r--r--. 1 root root 19G Aug 18 11:39 dns-18-08-18_10.11.22.log
-rw-r--r--. 1 root root 16G Aug 18 12:26 dns-18-08-18_11.00.00.log
-rw-r--r--. 1 root root 12M Aug 18 12:00 files-18-08-18_11.00.00.log
-rw-r--r--. 1 root root 5.2M Aug 18 12:18 files-18-08-18_12.00.00.log
-rw-r--r--. 1 root root 15K Aug 18 12:00 known_certs-18-08-18_11.00.00.log
-rw-r--r--. 1 root root 15K Aug 18 12:18 known_certs-18-08-18_12.00.00.log
-rw-r--r--. 1 root root 98K Aug 18 12:00 known_hosts-18-08-18_11.00.00.log
-rw-r--r--. 1 root root 24K Aug 18 12:18 known_hosts-18-08-18_12.00.00.log
-rw-r--r--. 1 root root 71K Aug 18 12:00 known_services-18-08-18_11.00.00.log
-rw-r--r--. 1 root root 5.2K Aug 18 12:18 known_services-18-08-18_12.00.00.log
-rw-r--r--. 1 root root 1.6K Aug 18 12:00 notice-18-08-18_11.00.00.log
-rw-r--r--. 1 root root 954 Aug 18 12:18 notice-18-08-18_12.00.00.log
-rw-r--r--. 1 root root 262 Aug 18 12:18 reporter-18-08-18_12.00.00.log
-rw-r--r--. 1 root root 23M Aug 18 12:00 smtp-18-08-18_11.00.00.log
-rw-r--r--. 1 root root 9.2M Aug 18 12:18 smtp-18-08-18_12.00.00.log
-rw-r--r--. 1 root root 1.2M Aug 18 12:00 snmp-18-08-18_11.00.00.log
-rw-r--r--. 1 root root 415K Aug 18 12:18 snmp-18-08-18_12.00.00.log
-rw-r--r--. 1 root root 81K Aug 18 12:00 software-18-08-18_11.00.00.log
-rw-r--r--. 1 root root 8.4K Aug 18 12:18 software-18-08-18_12.00.00.log
-rw-r--r--. 1 root root 30K Aug 18 12:00 ssh-18-08-18_11.00.00.log
-rw-r--r--. 1 root root 13K Aug 18 12:18 ssh-18-08-18_12.00.00.log
-rw-r--r--. 1 root root 217K Aug 18 12:00 ssl-18-08-18_11.00.00.log
-rw-r--r--. 1 root root 78K Aug 18 12:18 ssl-18-08-18_12.00.00.log
-rw-r--r--. 1 root root 37K Aug 18 12:00 stats-18-08-18_11.00.00.log
-rw-r--r--. 1 root root 16K Aug 18 12:18 stats-18-08-18_12.00.00.log
-rw-r--r--. 1 root root 28 Aug 18 12:18 stderr.log
-rw-r--r--. 1 root root 188 Aug 18 10:11 stdout.log
-rw-r--r--. 1 root root 6.8G Aug 18 12:00 weird-18-08-18_11.00.00.log
-rw-r--r--. 1 root root 2.5G Aug 18 12:18 weird-18-08-18_12.00.00.log
-rw-r--r--. 1 root root 178K Aug 18 12:00 x509-18-08-18_11.00.00.log
-rw-r--r--. 1 root root 80K Aug 18 12:18 x509-18-08-18_12.00.00.log

# /usr/local/bro/bin/bro --version
/usr/local/bro/bin/bro version 2.5.4
Azoff, Justin S
2018-08-20 14:30:35 UTC
Permalink
Post by Ron McClellan
All,
Having an issue with the bro logger crashing due to large volumes of DNS log traffic, 20-30GB an hour.
Is it actually crashing? Are you getting a crash report at all? From the filenames you listed it looks more like log rotation is failing.
Post by Ron McClellan
This is completely a local configuration, on a system with super-fast flash storage, 64 cores, 256GB RAM running BRO 2.5.4. If I disable DNS logging, everything works fine without issue. When I enable it, I get the results below. I thought it might be an issue with gzipping the old logs, so I replaced the standard gzip with pigz and I can manually compress the 30+ gig files in seconds, so don’t think that is the issue.
It could be related to the gzipping. The way log rotation works is not great.. all log files get compressed at the same time which can cause some thrashing.

If you set

compresslogs = 0

in broctl.cfg so that broctl does not gzip the logs at all, does the problem go away?

You could do something like that, and then run a script like:

while true; do
for f in /usr/local/bro/logs/201*/*.log ; do
gzip $f
done
sleep 60
done

to compress the logs in the background serially.

Another thing to keep an eye on is if your logger is able to keep up with the volume of data. This script is a plugin for munin, but you can run it directly:

#!/usr/bin/env python
import os
import sys
import time

DEFAULT_LOG = "/usr/local/bro/logs/current/dns.log"

def config():
print """
graph_category network

graph_title Bro log lag
graph_vlabel lag
graph_args --base 1000 --vertical-label seconds --lower-limit 0
graph_info The bro log lag

lag.label lag
lag.info log message lag in seconds
lag.min 0
lag.warning 0:15
lag.critical 0:60
""".strip()

return 0

def get_latest_time(fn):
f = open(fn)

f.seek(-4096, os.SEEK_END)
end = f.read().splitlines()[1:-1] #ignore possibly incomplete first and last lines
times = [line.split()[0] for line in end]
timestamps = map(float, times)
latest = max(timestamps)
return latest

def lag(fn):
lag = 500
for x in range(3):
try :
latest = get_latest_time(fn)
now = time.time()
lag = now - latest
break
except (IOError, ValueError):
#File could be rotating, wait and try again
time.sleep(5)
print "lag.value %f" % lag

if __name__ == "__main__":

filename = os.getenv("BRO_LAG_FILENAME", DEFAULT_LOG)

if sys.argv[1:] and sys.argv[1] == 'config':
config()
else:
lag(filename)

It will output something like

lag.value 2.919352

A normal value should be about 5, anything under 20 is probably ok. If it's 500 and climbing, that's a problem.

Also..
Post by Ron McClellan
-rw-r--r--. 1 root root 6.8G Aug 18 12:00 weird-18-08-18_11.00.00.log
-rw-r--r--. 1 root root 2.5G Aug 18 12:18 weird-18-08-18_12.00.00.log
That's a LOT of weird.log, what's going on there?


Justin Azoff
Chris Herdt
2018-08-20 14:54:15 UTC
Permalink
Hi Ron,

I had a similar issue where both the Bro DNS and weird logs were huge. In
my case, it turned out that our taps were primarily seeing inbound DNS
replies but not all of the outbound DNS requests, which triggered weird log
entries. Something to consider if the majority of your weird logs are
DNS-related entries.
Post by Ron McClellan
On Aug 19, 2018, at 11:12 AM, Ron McClellan <
All,
Having an issue with the bro logger crashing due to
large volumes of DNS log traffic, 20-30GB an hour.
Is it actually crashing? Are you getting a crash report at all? From the
filenames you listed it looks more like log rotation is failing.
This is completely a local configuration, on a system with super-fast
flash storage, 64 cores, 256GB RAM running BRO 2.5.4. If I disable DNS
logging, everything works fine without issue. When I enable it, I get the
results below. I thought it might be an issue with gzipping the old logs,
so I replaced the standard gzip with pigz and I can manually compress the
30+ gig files in seconds, so don’t think that is the issue.
It could be related to the gzipping. The way log rotation works is not
great.. all log files get compressed at the same time which can cause some
thrashing.
If you set
compresslogs = 0
in broctl.cfg so that broctl does not gzip the logs at all, does the problem go away?
while true; do
for f in /usr/local/bro/logs/201*/*.log ; do
gzip $f
done
sleep 60
done
to compress the logs in the background serially.
Another thing to keep an eye on is if your logger is able to keep up with
the volume of data. This script is a plugin for munin, but you can run it
#!/usr/bin/env python
import os
import sys
import time
DEFAULT_LOG = "/usr/local/bro/logs/current/dns.log"
print """
graph_category network
graph_title Bro log lag
graph_vlabel lag
graph_args --base 1000 --vertical-label seconds --lower-limit 0
graph_info The bro log lag
lag.label lag
lag.info log message lag in seconds
lag.min 0
lag.warning 0:15
lag.critical 0:60
""".strip()
return 0
f = open(fn)
f.seek(-4096, os.SEEK_END)
end = f.read().splitlines()[1:-1] #ignore possibly incomplete first and last lines
times = [line.split()[0] for line in end]
timestamps = map(float, times)
latest = max(timestamps)
return latest
lag = 500
latest = get_latest_time(fn)
now = time.time()
lag = now - latest
break
#File could be rotating, wait and try again
time.sleep(5)
print "lag.value %f" % lag
filename = os.getenv("BRO_LAG_FILENAME", DEFAULT_LOG)
config()
lag(filename)
It will output something like
lag.value 2.919352
A normal value should be about 5, anything under 20 is probably ok. If
it's 500 and climbing, that's a problem.
Also..
-rw-r--r--. 1 root root 6.8G Aug 18 12:00 weird-18-08-18_11.00.00.log
-rw-r--r--. 1 root root 2.5G Aug 18 12:18 weird-18-08-18_12.00.00.log
That's a LOT of weird.log, what's going on there?
—
Justin Azoff
_______________________________________________
Bro mailing list
http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/bro
--
Chris Herdt
UIS Systems Administrator
***@umn.edu
Ron McClellan
2018-08-20 15:48:08 UTC
Permalink
Justin,

Thanks, I turned off compression and so for 2+ hours, everything is working well. I kinda had an idea it was related to the compression, but thought the pigz replacement would take care of that, guess not. Appreciate the help. Will let everyone know how it goes over the long term. I think you and Chris hit the nail on the head about the weird logs. I haven't really started tuning much, wanted to get the system nice and stable first and then start tuning and looking at the weird stuff, which is heavy DNS.

Thanks Again,

Ron



[root@ current]# cat weird.log | bro-cut name|sort|uniq -c|sort -rn
34264380 dns_unmatched_msg
16696030 dns_unmatched_reply
330912 DNS_RR_unknown_type
62288 possible_split_routing
59512 data_before_established
38396 NUL_in_line
21210 inappropriate_FIN
21209 line_terminated_with_single_CR
18978 DNS_RR_length_mismatch
1852 bad_TCP_checksum
1060 dnp3_corrupt_header_checksum
922 truncated_tcp_payload
326 dnp3_header_lacks_magic
230 DNS_truncated_RR_rdlength_lt_len
92 non_ip_packet_in_ethernet
92 above_hole_data_without_any_acks
48 SYN_seq_jump
46 window_recision
46 dns_unmatched_msg_quantity
46 DNS_truncated_ans_too_short
46 DNS_RR_bad_length
46 DNS_Conn_count_too_large
46 ayiya_tunnel_non_ip






-----Original Message-----
From: Azoff, Justin S <***@illinois.edu>
Sent: Monday, August 20, 2018 10:31 AM
To: Ron McClellan <***@ao.uscourts.gov>
Cc: ***@bro.org
Subject: Re: [Bro] BRO Logger crashing due to large DNS log files
Post by Ron McClellan
All,
Having an issue with the bro logger crashing due to large volumes of DNS log traffic, 20-30GB an hour.
Is it actually crashing? Are you getting a crash report at all? From the filenames you listed it looks more like log rotation is failing.
Post by Ron McClellan
This is completely a local configuration, on a system with super-fast flash storage, 64 cores, 256GB RAM running BRO 2.5.4. If I disable DNS logging, everything works fine without issue. When I enable it, I get the results below. I thought it might be an issue with gzipping the old logs, so I replaced the standard gzip with pigz and I can manually compress the 30+ gig files in seconds, so don’t think that is the issue.
It could be related to the gzipping. The way log rotation works is not great.. all log files get compressed at the same time which can cause some thrashing.

If you set

compresslogs = 0

in broctl.cfg so that broctl does not gzip the logs at all, does the problem go away?

You could do something like that, and then run a script like:

while true; do
for f in /usr/local/bro/logs/201*/*.log ; do
gzip $f
done
sleep 60
done

to compress the logs in the background serially.

Another thing to keep an eye on is if your logger is able to keep up with the volume of data. This script is a plugin for munin, but you can run it directly:

#!/usr/bin/env python
import os
import sys
import time

DEFAULT_LOG = "/usr/local/bro/logs/current/dns.log"

def config():
print """
graph_category network

graph_title Bro log lag
graph_vlabel lag
graph_args --base 1000 --vertical-label seconds --lower-limit 0 graph_info The bro log lag

lag.label lag
lag.info log message lag in seconds
lag.min 0
lag.warning 0:15
lag.critical 0:60
""".strip()

return 0

def get_latest_time(fn):
f = open(fn)

f.seek(-4096, os.SEEK_END)
end = f.read().splitlines()[1:-1] #ignore possibly incomplete first and last lines
times = [line.split()[0] for line in end]
timestamps = map(float, times)
latest = max(timestamps)
return latest

def lag(fn):
lag = 500
for x in range(3):
try :
latest = get_latest_time(fn)
now = time.time()
lag = now - latest
break
except (IOError, ValueError):
#File could be rotating, wait and try again
time.sleep(5)
print "lag.value %f" % lag

if __name__ == "__main__":

filename = os.getenv("BRO_LAG_FILENAME", DEFAULT_LOG)

if sys.argv[1:] and sys.argv[1] == 'config':
config()
else:
lag(filename)

It will output something like

lag.value 2.919352

A normal value should be about 5, anything under 20 is probably ok. If it's 500 and climbing, that's a problem.

Also..
Post by Ron McClellan
-rw-r--r--. 1 root root 6.8G Aug 18 12:00 weird-18-08-18_11.00.00.log
-rw-r--r--. 1 root root 2.5G Aug 18 12:18 weird-18-08-18_12.00.00.log
That's a LOT of weird.log, what's going on there?


Justin Azoff
Ron McClellan
2018-08-20 18:04:54 UTC
Permalink
Update:

Worked for almost 3 hours, but then started failing again. I even changed the log rotation to every 15 minutes and it still crashes . Any other sugestions? Has anyone ever tried to configured syslog-ng to handle the logging?


Warning: broctl config has changed (run the broctl "deploy" command)
Name Type Host Status Pid Started
logger logger localhost terminating 28295 20 Aug 12:30:03
manager manager localhost running 28336 20 Aug 12:30:05
proxy-1 proxy localhost running 28375 20 Aug 12:30:06
worker-1-1 worker localhost running 28565 20 Aug 12:30:08


Thanks,

Ron


-----Original Message-----
From: bro-***@bro.org <bro-***@bro.org> On Behalf Of Ron McClellan
Sent: Monday, August 20, 2018 11:48 AM
To: Azoff, Justin S <***@illinois.edu>
Cc: ***@bro.org
Subject: Re: [Bro] BRO Logger crashing due to large DNS log files

Justin,

Thanks, I turned off compression and so for 2+ hours, everything is working well. I kinda had an idea it was related to the compression, but thought the pigz replacement would take care of that, guess not. Appreciate the help. Will let everyone know how it goes over the long term. I think you and Chris hit the nail on the head about the weird logs. I haven't really started tuning much, wanted to get the system nice and stable first and then start tuning and looking at the weird stuff, which is heavy DNS.

Thanks Again,

Ron



[root@ current]# cat weird.log | bro-cut name|sort|uniq -c|sort -rn
34264380 dns_unmatched_msg
16696030 dns_unmatched_reply
330912 DNS_RR_unknown_type
62288 possible_split_routing
59512 data_before_established
38396 NUL_in_line
21210 inappropriate_FIN
21209 line_terminated_with_single_CR
18978 DNS_RR_length_mismatch
1852 bad_TCP_checksum
1060 dnp3_corrupt_header_checksum
922 truncated_tcp_payload
326 dnp3_header_lacks_magic
230 DNS_truncated_RR_rdlength_lt_len
92 non_ip_packet_in_ethernet
92 above_hole_data_without_any_acks
48 SYN_seq_jump
46 window_recision
46 dns_unmatched_msg_quantity
46 DNS_truncated_ans_too_short
46 DNS_RR_bad_length
46 DNS_Conn_count_too_large
46 ayiya_tunnel_non_ip






-----Original Message-----
From: Azoff, Justin S <***@illinois.edu>
Sent: Monday, August 20, 2018 10:31 AM
To: Ron McClellan <***@ao.uscourts.gov>
Cc: ***@bro.org
Subject: Re: [Bro] BRO Logger crashing due to large DNS log files
Post by Ron McClellan
All,
Having an issue with the bro logger crashing due to large volumes of DNS log traffic, 20-30GB an hour.
Is it actually crashing? Are you getting a crash report at all? From the filenames you listed it looks more like log rotation is failing.
Post by Ron McClellan
This is completely a local configuration, on a system with super-fast flash storage, 64 cores, 256GB RAM running BRO 2.5.4. If I disable DNS logging, everything works fine without issue. When I enable it, I get the results below. I thought it might be an issue with gzipping the old logs, so I replaced the standard gzip with pigz and I can manually compress the 30+ gig files in seconds, so don’t think that is the issue.
It could be related to the gzipping. The way log rotation works is not great.. all log files get compressed at the same time which can cause some thrashing.

If you set

compresslogs = 0

in broctl.cfg so that broctl does not gzip the logs at all, does the problem go away?

You could do something like that, and then run a script like:

while true; do
for f in /usr/local/bro/logs/201*/*.log ; do
gzip $f
done
sleep 60
done

to compress the logs in the background serially.

Another thing to keep an eye on is if your logger is able to keep up with the volume of data. This script is a plugin for munin, but you can run it directly:

#!/usr/bin/env python
import os
import sys
import time

DEFAULT_LOG = "/usr/local/bro/logs/current/dns.log"

def config():
print """
graph_category network

graph_title Bro log lag
graph_vlabel lag
graph_args --base 1000 --vertical-label seconds --lower-limit 0 graph_info The bro log lag

lag.label lag
lag.info log message lag in seconds
lag.min 0
lag.warning 0:15
lag.critical 0:60
""".strip()

return 0

def get_latest_time(fn):
f = open(fn)

f.seek(-4096, os.SEEK_END)
end = f.read().splitlines()[1:-1] #ignore possibly incomplete first and last lines
times = [line.split()[0] for line in end]
timestamps = map(float, times)
latest = max(timestamps)
return latest

def lag(fn):
lag = 500
for x in range(3):
try :
latest = get_latest_time(fn)
now = time.time()
lag = now - latest
break
except (IOError, ValueError):
#File could be rotating, wait and try again
time.sleep(5)
print "lag.value %f" % lag

if __name__ == "__main__":

filename = os.getenv("BRO_LAG_FILENAME", DEFAULT_LOG)

if sys.argv[1:] and sys.argv[1] == 'config':
config()
else:
lag(filename)

It will output something like

lag.value 2.919352

A normal value should be about 5, anything under 20 is probably ok. If it's 500 and climbing, that's a problem.

Also..
Post by Ron McClellan
-rw-r--r--. 1 root root 6.8G Aug 18 12:00 weird-18-08-18_11.00.00.log
-rw-r--r--. 1 root root 2.5G Aug 18 12:18 weird-18-08-18_12.00.00.log
That's a LOT of weird.log, what's going on there?


Justin Azoff



_______________________________________________
Bro mailing list
***@bro-ids.org
http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/bro
Jim Mellander
2018-08-20 21:16:13 UTC
Permalink
I hope I'm not asking the obvious, but was the warning heeded?

Warning: broctl config has changed (run the broctl "deploy" command)





On Mon, Aug 20, 2018 at 11:04 AM, Ron McClellan <
Post by Ron McClellan
Worked for almost 3 hours, but then started failing again. I even
changed the log rotation to every 15 minutes and it still crashes . Any
other sugestions? Has anyone ever tried to configured syslog-ng to handle
the logging?
Warning: broctl config has changed (run the broctl "deploy" command)
Name Type Host Status Pid Started
logger logger localhost terminating 28295 20 Aug 12:30:03
manager manager localhost running 28336 20 Aug 12:30:05
proxy-1 proxy localhost running 28375 20 Aug 12:30:06
worker-1-1 worker localhost running 28565 20 Aug 12:30:08
Thanks,
Ron
-----Original Message-----
Sent: Monday, August 20, 2018 11:48 AM
Subject: Re: [Bro] BRO Logger crashing due to large DNS log files
Justin,
Thanks, I turned off compression and so for 2+ hours, everything
is working well. I kinda had an idea it was related to the compression,
but thought the pigz replacement would take care of that, guess not.
Appreciate the help. Will let everyone know how it goes over the long
term. I think you and Chris hit the nail on the head about the weird
logs. I haven't really started tuning much, wanted to get the system nice
and stable first and then start tuning and looking at the weird stuff,
which is heavy DNS.
Thanks Again,
Ron
34264380 dns_unmatched_msg
16696030 dns_unmatched_reply
330912 DNS_RR_unknown_type
62288 possible_split_routing
59512 data_before_established
38396 NUL_in_line
21210 inappropriate_FIN
21209 line_terminated_with_single_CR
18978 DNS_RR_length_mismatch
1852 bad_TCP_checksum
1060 dnp3_corrupt_header_checksum
922 truncated_tcp_payload
326 dnp3_header_lacks_magic
230 DNS_truncated_RR_rdlength_lt_len
92 non_ip_packet_in_ethernet
92 above_hole_data_without_any_acks
48 SYN_seq_jump
46 window_recision
46 dns_unmatched_msg_quantity
46 DNS_truncated_ans_too_short
46 DNS_RR_bad_length
46 DNS_Conn_count_too_large
46 ayiya_tunnel_non_ip
-----Original Message-----
Sent: Monday, August 20, 2018 10:31 AM
Subject: Re: [Bro] BRO Logger crashing due to large DNS log files
On Aug 19, 2018, at 11:12 AM, Ron McClellan <
All,
Having an issue with the bro logger crashing due to
large volumes of DNS log traffic, 20-30GB an hour.
Is it actually crashing? Are you getting a crash report at all? From the
filenames you listed it looks more like log rotation is failing.
This is completely a local configuration, on a system with super-fast
flash storage, 64 cores, 256GB RAM running BRO 2.5.4. If I disable DNS
logging, everything works fine without issue. When I enable it, I get the
results below. I thought it might be an issue with gzipping the old logs,
so I replaced the standard gzip with pigz and I can manually compress the
30+ gig files in seconds, so don’t think that is the issue.
It could be related to the gzipping. The way log rotation works is not
great.. all log files get compressed at the same time which can cause some
thrashing.
If you set
compresslogs = 0
in broctl.cfg so that broctl does not gzip the logs at all, does the problem go away?
while true; do
for f in /usr/local/bro/logs/201*/*.log ; do
gzip $f
done
sleep 60
done
to compress the logs in the background serially.
Another thing to keep an eye on is if your logger is able to keep up with
the volume of data. This script is a plugin for munin, but you can run it
#!/usr/bin/env python
import os
import sys
import time
DEFAULT_LOG = "/usr/local/bro/logs/current/dns.log"
print """
graph_category network
graph_title Bro log lag
graph_vlabel lag
graph_args --base 1000 --vertical-label seconds --lower-limit 0 graph_info The bro log lag
lag.label lag
lag.info log message lag in seconds
lag.min 0
lag.warning 0:15
lag.critical 0:60
""".strip()
return 0
f = open(fn)
f.seek(-4096, os.SEEK_END)
end = f.read().splitlines()[1:-1] #ignore possibly incomplete first and last lines
times = [line.split()[0] for line in end]
timestamps = map(float, times)
latest = max(timestamps)
return latest
lag = 500
latest = get_latest_time(fn)
now = time.time()
lag = now - latest
break
#File could be rotating, wait and try again
time.sleep(5)
print "lag.value %f" % lag
filename = os.getenv("BRO_LAG_FILENAME", DEFAULT_LOG)
config()
lag(filename)
It will output something like
lag.value 2.919352
A normal value should be about 5, anything under 20 is probably ok. If
it's 500 and climbing, that's a problem.
Also..
-rw-r--r--. 1 root root 6.8G Aug 18 12:00 weird-18-08-18_11.00.00.log
-rw-r--r--. 1 root root 2.5G Aug 18 12:18 weird-18-08-18_12.00.00.log
That's a LOT of weird.log, what's going on there?
—
Justin Azoff
_______________________________________________
Bro mailing list
http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/bro
_______________________________________________
Bro mailing list
http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/bro
Ron McClellan
2018-08-21 12:14:28 UTC
Permalink
Jim,

Never hurts to ask, but I did use deploy.

Thanks,

Ron



From: Jim Mellander <***@lbl.gov>
Sent: Monday, August 20, 2018 5:16 PM
To: Ron McClellan <***@ao.uscourts.gov>
Cc: Azoff, Justin S <***@illinois.edu>; ***@bro.org
Subject: Re: [Bro] BRO Logger crashing due to large DNS log files

I hope I'm not asking the obvious, but was the warning heeded?

Warning: broctl config has changed (run the broctl "deploy" command)





On Mon, Aug 20, 2018 at 11:04 AM, Ron McClellan <***@ao.uscourts.gov<mailto:***@ao.uscourts.gov>> wrote:
Update:

Worked for almost 3 hours, but then started failing again. I even changed the log rotation to every 15 minutes and it still crashes . Any other sugestions? Has anyone ever tried to configured syslog-ng to handle the logging?


Warning: broctl config has changed (run the broctl "deploy" command)
Name Type Host Status Pid Started
logger logger localhost terminating 28295 20 Aug 12:30:03
manager manager localhost running 28336 20 Aug 12:30:05
proxy-1 proxy localhost running 28375 20 Aug 12:30:06
worker-1-1 worker localhost running 28565 20 Aug 12:30:08


Thanks,

Ron


-----Original Message-----
From: bro-***@bro.org<mailto:bro-***@bro.org> <bro-***@bro.org<mailto:bro-***@bro.org>> On Behalf Of Ron McClellan
Sent: Monday, August 20, 2018 11:48 AM
To: Azoff, Justin S <***@illinois.edu<mailto:***@illinois.edu>>
Cc: ***@bro.org<mailto:***@bro.org>
Subject: Re: [Bro] BRO Logger crashing due to large DNS log files

Justin,

Thanks, I turned off compression and so for 2+ hours, everything is working well. I kinda had an idea it was related to the compression, but thought the pigz replacement would take care of that, guess not. Appreciate the help. Will let everyone know how it goes over the long term. I think you and Chris hit the nail on the head about the weird logs. I haven't really started tuning much, wanted to get the system nice and stable first and then start tuning and looking at the weird stuff, which is heavy DNS.

Thanks Again,

Ron



[root@ current]# cat weird.log | bro-cut name|sort|uniq -c|sort -rn
34264380 dns_unmatched_msg
16696030 dns_unmatched_reply
330912 DNS_RR_unknown_type
62288 possible_split_routing
59512 data_before_established
38396 NUL_in_line
21210 inappropriate_FIN
21209 line_terminated_with_single_CR
18978 DNS_RR_length_mismatch
1852 bad_TCP_checksum
1060 dnp3_corrupt_header_checksum
922 truncated_tcp_payload
326 dnp3_header_lacks_magic
230 DNS_truncated_RR_rdlength_lt_len
92 non_ip_packet_in_ethernet
92 above_hole_data_without_any_acks
48 SYN_seq_jump
46 window_recision
46 dns_unmatched_msg_quantity
46 DNS_truncated_ans_too_short
46 DNS_RR_bad_length
46 DNS_Conn_count_too_large
46 ayiya_tunnel_non_ip






-----Original Message-----
From: Azoff, Justin S <***@illinois.edu<mailto:***@illinois.edu>>
Sent: Monday, August 20, 2018 10:31 AM
To: Ron McClellan <***@ao.uscourts.gov<mailto:***@ao.uscourts.gov>>
Cc: ***@bro.org<mailto:***@bro.org>
Subject: Re: [Bro] BRO Logger crashing due to large DNS log files
Post by Ron McClellan
All,
Having an issue with the bro logger crashing due to large volumes of DNS log traffic, 20-30GB an hour.
Is it actually crashing? Are you getting a crash report at all? From the filenames you listed it looks more like log rotation is failing.
Post by Ron McClellan
This is completely a local configuration, on a system with super-fast flash storage, 64 cores, 256GB RAM running BRO 2.5.4. If I disable DNS logging, everything works fine without issue. When I enable it, I get the results below. I thought it might be an issue with gzipping the old logs, so I replaced the standard gzip with pigz and I can manually compress the 30+ gig files in seconds, so don’t think that is the issue.
It could be related to the gzipping. The way log rotation works is not great.. all log files get compressed at the same time which can cause some thrashing.

If you set

compresslogs = 0

in broctl.cfg so that broctl does not gzip the logs at all, does the problem go away?

You could do something like that, and then run a script like:

while true; do
for f in /usr/local/bro/logs/201*/*.log ; do
gzip $f
done
sleep 60
done

to compress the logs in the background serially.

Another thing to keep an eye on is if your logger is able to keep up with the volume of data. This script is a plugin for munin, but you can run it directly:

#!/usr/bin/env python
import os
import sys
import time

DEFAULT_LOG = "/usr/local/bro/logs/current/dns.log"

def config():
print """
graph_category network

graph_title Bro log lag
graph_vlabel lag
graph_args --base 1000 --vertical-label seconds --lower-limit 0 graph_info The bro log lag

lag.label lag
lag.info<http://lag.info> log message lag in seconds
lag.min 0
lag.warning 0:15
lag.critical 0:60
""".strip()

return 0

def get_latest_time(fn):
f = open(fn)

f.seek(-4096, os.SEEK_END)
end = f.read().splitlines()[1:-1] #ignore possibly incomplete first and last lines
times = [line.split()[0] for line in end]
timestamps = map(float, times)
latest = max(timestamps)
return latest

def lag(fn):
lag = 500
for x in range(3):
try :
latest = get_latest_time(fn)
now = time.time()
lag = now - latest
break
except (IOError, ValueError):
#File could be rotating, wait and try again
time.sleep(5)
print "lag.value %f" % lag

if __name__ == "__main__":

filename = os.getenv("BRO_LAG_FILENAME", DEFAULT_LOG)

if sys.argv[1:] and sys.argv[1] == 'config':
config()
else:
lag(filename)

It will output something like

lag.value 2.919352

A normal value should be about 5, anything under 20 is probably ok. If it's 500 and climbing, that's a problem.

Also..
Post by Ron McClellan
-rw-r--r--. 1 root root 6.8G Aug 18 12:00 weird-18-08-18_11.00.00.log
-rw-r--r--. 1 root root 2.5G Aug 18 12:18 weird-18-08-18_12.00.00.log
That's a LOT of weird.log, what's going on there?

—
Justin Azoff



_______________________________________________
Bro mailing list
***@bro-ids.org<mailto:***@bro-ids.org>
http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/bro

_______________________________________________
Bro mailing list
***@bro-ids.org<mailto:***@bro-ids.org>
http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/bro
Azoff, Justin S
2018-08-21 01:08:11 UTC
Permalink
Post by Ron McClellan
Worked for almost 3 hours, but then started failing again. I even changed the log rotation to every 15 minutes and it still crashes . Any other sugestions? Has anyone ever tried to configured syslog-ng to handle the logging?
Warning: broctl config has changed (run the broctl "deploy" command)
Name Type Host Status Pid Started
logger logger localhost terminating 28295 20 Aug 12:30:03
manager manager localhost running 28336 20 Aug 12:30:05
proxy-1 proxy localhost running 28375 20 Aug 12:30:06
worker-1-1 worker localhost running 28565 20 Aug 12:30:08
That's really interesting.. because "terminating" means something very specific, and not something you would see if it was crashing.

Unfortunately broctl throws away the 2nd part of the status file that would narrow that down further, but there are only a few reasons:

src/main.cc
275: set_processing_status("TERMINATING", "done_with_network");
331: set_processing_status("TERMINATING", "terminate_bro");
392: set_processing_status("TERMINATING", "termination_signal");
413: set_processing_status("TERMINATING", "sig_handler");

src/Net.cc
432: set_processing_status("TERMINATING", "net_finish");
457: set_processing_status("TERMINATING", "net_delete");

done_with_network, net_finish, and net_delete wouldn't apply to a logger node that has no network interfaces.

termination_signal and sig_handler happen when bro gets a SIGINT or SIGTERM, and terminate_bro happens
when bro exits normally.

If it does happen again and stays like that if you could run

$ sudo cat /usr/local/bro/spool/logger/.status
RUNNING [net_run]

that should show

TERMINATING [one of those reasons]

which would definitively narrow it down.

Is there anything on your system that would be killing bro? If it were the kernel OOM killer I'd expect that to show up as crashed and not terminating.


Justin Azoff
Ron McClellan
2018-08-21 12:21:39 UTC
Permalink
Justin,

Nothing really on the system that would be killing logger, system is a base CENTOS 7 box, recently built just for BRO. The .status file shows "TERMINATED[atexit]".

Ron

[root@ ron]# sudo cat /logs/bro/spool/logger/.status
TERMINATED [atexit]

Name Type Host Status Pid Started
logger logger localhost crashed
manager manager localhost running 55680 20 Aug 15:25:47
proxy-1 proxy localhost running 55719 20 Aug 15:25:49
worker-1-1 worker localhost running 55893 20 Aug 15:25:50
worker-1-2 worker localhost running 55897 20 Aug 15:25:50
worker-1-40 worker localhost running 56411 20 Aug 15:25:50
worker-1-41 worker localhost crashed
worker-1-42 worker localhost running 56444 20 Aug 15:25:50
worker-1-43 worker localhost running 56446 20 Aug 15:25:50

-----Original Message-----
From: Azoff, Justin S <***@illinois.edu>
Sent: Monday, August 20, 2018 9:08 PM
To: Ron McClellan <***@ao.uscourts.gov>
Cc: ***@bro.org
Subject: Re: [Bro] BRO Logger crashing due to large DNS log files
Post by Ron McClellan
Worked for almost 3 hours, but then started failing again. I even changed the log rotation to every 15 minutes and it still crashes . Any other sugestions? Has anyone ever tried to configured syslog-ng to handle the logging?
Warning: broctl config has changed (run the broctl "deploy" command)
Name Type Host Status Pid Started
logger logger localhost terminating 28295 20 Aug 12:30:03
manager manager localhost running 28336 20 Aug 12:30:05
proxy-1 proxy localhost running 28375 20 Aug 12:30:06
worker-1-1 worker localhost running 28565 20 Aug 12:30:08
That's really interesting.. because "terminating" means something very specific, and not something you would see if it was crashing.

Unfortunately broctl throws away the 2nd part of the status file that would narrow that down further, but there are only a few reasons:

src/main.cc
275: set_processing_status("TERMINATING", "done_with_network");
331: set_processing_status("TERMINATING", "terminate_bro");
392: set_processing_status("TERMINATING", "termination_signal");
413: set_processing_status("TERMINATING", "sig_handler");

src/Net.cc
432: set_processing_status("TERMINATING", "net_finish");
457: set_processing_status("TERMINATING", "net_delete");

done_with_network, net_finish, and net_delete wouldn't apply to a logger node that has no network interfaces.

termination_signal and sig_handler happen when bro gets a SIGINT or SIGTERM, and terminate_bro happens when bro exits normally.

If it does happen again and stays like that if you could run

$ sudo cat /usr/local/bro/spool/logger/.status
RUNNING [net_run]

that should show

TERMINATING [one of those reasons]

which would definitively narrow it down.

Is there anything on your system that would be killing bro? If it were the kernel OOM killer I'd expect that to show up as crashed and not terminating.


Justin Azoff
Azoff, Justin S
2018-08-21 18:35:37 UTC
Permalink
Post by Ron McClellan
Justin,
Nothing really on the system that would be killing logger, system is a base CENTOS 7 box, recently built just for BRO. The .status file shows "TERMINATED[atexit]".
Ron
TERMINATED [atexit]
Name Type Host Status Pid Started
logger logger localhost crashed
Ah.. well now that says 'crashed' which is what you'd expect if it was crashing (not 'terminating')

If it is crashing then something should say why...

Is broctl sending you a crash report when that happens? What does broctl diag say?

Are there any kernel OOM messages in dmesg or syslog?

Or any messages that look like

bro[60506]: segfault at 0 ip 00000000005fcf8d sp 00007fffaf9d2f40 error 6 in bro[400000+624000]


Justin Azoff
Ron McClellan
2018-08-21 18:53:32 UTC
Permalink
Justin,

The first 5 lines are consistent, the last 2 lines the first time seen were today. Crash report wasn't very useful (see below), diag was pretty much the same. Hopefully the OOM message helps.

Ron


Aug 21 09:45:18 aosoc kernel: Out of memory: Kill process 6610 (bro) score 507 or sacrifice child
Aug 21 09:45:18 aosoc kernel: Killed process 6610 (bro) total-vm:139995144kB, anon-rss:137467264kB, file-rss:0kB, shmem-rss:0kB
Aug 21 11:32:23 aosoc kernel: bro invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Aug 21 11:32:23 aosoc kernel: bro cpuset=/ mems_allowed=0-1
Aug 21 11:32:23 aosoc kernel: CPU: 57 PID: 21655 Comm: bro Kdump: loaded Not tainted 3.10.0-862.11.6.el7.x86_64 #1
Aug 21 11:32:23 aosoc kernel: Out of memory: Kill process 20158 (bro) score 544 or sacrifice child
Aug 21 11:32:23 aosoc kernel: Killed process 20158 (bro) total-vm:150275592kB, anon-rss:147621508kB, file-rss:0kB, shmem-rss:0kB



===============Crash Report===================

This crash report does not include a backtrace. In order for crash reports to be useful when Bro crashes, a backtrace is needed.

No core file found and gdb is not installed. It is recommended to install gdb so that BroControl can output a backtrace if Bro crashes.

Bro 2.5.4
Linux 3.10.0-862.11.6.el7.x86_64

Bro plugins: (none found)

==== No reporter.log

==== stderr.log
received termination signal

==== stdout.log
max memory size (kbytes, -m) unlimited
data seg size (kbytes, -d) unlimited
virtual memory (kbytes, -v) unlimited
core file size (blocks, -c) unlimited

==== .cmdline
-U .status -p broctl -p broctl-live -p local -p logger local.bro broctl base/frameworks/cluster local-logger.bro broctl/auto

==== .env_vars
PATH=/usr/local/bro/bin:/usr/local/bro/share/broctl/scripts:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/local/bro/bin:/home/ron/.local/bin:/home/ron/bin:/usr/local/bro/bin
BROPATH=/logs/bro/spool/installed-scripts-do-not-touch/site::/logs/bro/spool/installed-scripts-do-not-touch/auto:/usr/local/bro/share/bro:/usr/local/bro/share/bro/policy:/usr/local/bro/share/bro/site
CLUSTER_NODE=logger

==== .status
TERMINATED [atexit]

==== No prof.log

==== No packet_filter.log

==== No loaded_scripts.log

--
[Automatically generated.]



-----Original Message-----
From: Azoff, Justin S <***@illinois.edu>
Sent: Tuesday, August 21, 2018 2:36 PM
To: Ron McClellan <***@ao.uscourts.gov>
Cc: ***@bro.org
Subject: Re: [Bro] BRO Logger crashing due to large DNS log files
Post by Ron McClellan
Justin,
Nothing really on the system that would be killing logger, system is a base CENTOS 7 box, recently built just for BRO. The .status file shows "TERMINATED[atexit]".
Ron
[atexit]
Name Type Host Status Pid Started
logger logger localhost crashed
Ah.. well now that says 'crashed' which is what you'd expect if it was crashing (not 'terminating')

If it is crashing then something should say why...

Is broctl sending you a crash report when that happens? What does broctl diag say?

Are there any kernel OOM messages in dmesg or syslog?

Or any messages that look like

bro[60506]: segfault at 0 ip 00000000005fcf8d sp 00007fffaf9d2f40 error 6 in bro[400000+624000]


Justin Azoff
Azoff, Justin S
2018-08-21 20:08:17 UTC
Permalink
Post by Ron McClellan
Justin,
The first 5 lines are consistent, the last 2 lines the first time seen were today. Crash report wasn't very useful (see below), diag was pretty much the same. Hopefully the OOM message helps.
Ron
Aug 21 09:45:18 aosoc kernel: Out of memory: Kill process 6610 (bro) score 507 or sacrifice child
Aug 21 09:45:18 aosoc kernel: Killed process 6610 (bro) total-vm:139995144kB, anon-rss:137467264kB, file-rss:0kB, shmem-rss:0kB
Aug 21 11:32:23 aosoc kernel: bro invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Aug 21 11:32:23 aosoc kernel: bro cpuset=/ mems_allowed=0-1
Aug 21 11:32:23 aosoc kernel: CPU: 57 PID: 21655 Comm: bro Kdump: loaded Not tainted 3.10.0-862.11.6.el7.x86_64 #1
Aug 21 11:32:23 aosoc kernel: Out of memory: Kill process 20158 (bro) score 544 or sacrifice child
Aug 21 11:32:23 aosoc kernel: Killed process 20158 (bro) total-vm:150275592kB, anon-rss:147621508kB, file-rss:0kB, shmem-rss:0kB
Ah, this is great.. well, not great in that it is crashing but great in that now we know what is wrong: You're running out of ram.

So you said you had 256GB, which should normally be more than enough as long as everything is working properly, but I have a feeling some things are not working quite right though.

Have you had a chance to run that python program I posted? If you have a high amount of log lag, something is not keeping up well.

Do you have any graphs of memory usage on your host?

What exactly does this output:

$ cat /proc/cpuinfo |grep 'model name'|sort|uniq -c
40 model name : Intel(R) Xeon(R) CPU E5-2470 v2 @ 2.40GHz


The fact that you are seeing

34264380 dns_unmatched_msg
16696030 dns_unmatched_reply
62288 possible_split_routing
59512 data_before_established

in your weird.log points to something being very very wrong with your traffic. This can cause bro to work many times harder than it needs to.

How is your load balancing setup in your node.cfg?

Can you try running bro-doctor from bro-pkg: https://packages.bro.org/packages/view/74d45e8c-4fb7-11e8-88be-0a645a3f3086

If you can't run bro-pkg you just need to grab

https://raw.githubusercontent.com/ncsa/bro-doctor/master/doctor.py

and drop it in

/usr/local/bro/lib/broctl/plugins/doctor.py

and run broctl doctor.bro




Justin Azoff
Ron McClellan
2018-08-21 20:22:13 UTC
Permalink
Thanks Justin, here is the info.

Ron


######################Memory CPUI################################################################

[root@ ron]# free
total used free shared buff/cache available
Mem: 263620592 241469748 861800 4464 21289044 21038584
Swap: 4194300 80508 4113792

[root@ current]# cat /proc/cpuinfo |grep 'model name'|sort|uniq -c
72 model name : Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz


######################Perl Script OutputI################################################################

lag.value 500.000000

######################node.cfg################################################################

cat /usr/local/bro/etc/node.cfg
# Example BroControl node configuration.
#
# This example has a standalone node ready to go except for possibly changing
# the sniffing interface.

# This is a complete standalone configuration. Most likely you will
# only need to change the interface.
#[bro]
#type=standalone
#host=localhost
#interface=eno1

## Below is an example clustered configuration. If you use this,
## remove the [bro] node above.

[logger]
type=logger
host=localhost
pin_cpus=3,4,5,6,7,8,9,10,11

[manager]
type=manager
host=localhost

[proxy-1]
type=proxy
host=localhost
#
[worker-1]
type=worker
host=localhost
interface=ens1f0
lb_method=pf_ring
lb_procs=48
pin_cpus=12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60


#[logger]
#type=logger
#host=localhost
#
#[manager]
#type=manager
#host=localhost
#
#[proxy-1]
#type=proxy
#host=localhost
#
#[worker-1]
#type=worker
#host=localhost
#interface=eth0
#
#[worker-2]
#type=worker
#host=localhost
#interface=eth0

##############################################################################




[root@ ron]# broctl doctor.bro
#################################################################
# Checking if many recent connections have a SAD or had history #
#################################################################
error: No conn log files in the past day???


################################################
# Checking for recent capture_loss.log entries #
################################################
Capture loss stats:
worker=worker-1-1 loss_count=107 noloss_count=0 min_loss=0.219585 max_loss=2.807555 overall_loss=0.736212145529
worker=worker-1-10 loss_count=107 noloss_count=0 min_loss=0.219585 max_loss=2.804896 overall_loss=0.73353383007
worker=worker-1-11 loss_count=102 noloss_count=0 min_loss=0.236097 max_loss=2.803468 overall_loss=0.747731387284
worker=worker-1-12 loss_count=111 noloss_count=0 min_loss=0.225453 max_loss=2.810418 overall_loss=0.713610666693
worker=worker-1-13 loss_count=106 noloss_count=0 min_loss=0.225238 max_loss=2.805285 overall_loss=0.737752297911
worker=worker-1-14 loss_count=103 noloss_count=0 min_loss=0.235811 max_loss=2.803339 overall_loss=0.75462047398
worker=worker-1-15 loss_count=113 noloss_count=0 min_loss=0.232239 max_loss=2.836904 overall_loss=0.723967320858
worker=worker-1-16 loss_count=107 noloss_count=0 min_loss=0.241669 max_loss=2.843484 overall_loss=0.756960461224
worker=worker-1-17 loss_count=103 noloss_count=0 min_loss=0.239883 max_loss=2.840825 overall_loss=0.763862024976
worker=worker-1-18 loss_count=113 noloss_count=0 min_loss=0.239311 max_loss=2.839503 overall_loss=0.737882275017
worker=worker-1-19 loss_count=107 noloss_count=0 min_loss=0.225381 max_loss=2.840043 overall_loss=0.744923084459
worker=worker-1-2 loss_count=103 noloss_count=0 min_loss=0.219585 max_loss=2.807167 overall_loss=0.749798789791
worker=worker-1-20 loss_count=103 noloss_count=0 min_loss=0.224953 max_loss=2.837683 overall_loss=0.758886002296
worker=worker-1-21 loss_count=113 noloss_count=0 min_loss=0.240954 max_loss=2.844588 overall_loss=0.735745101629
worker=worker-1-22 loss_count=105 noloss_count=0 min_loss=0.235882 max_loss=2.842643 overall_loss=0.749374186886
worker=worker-1-23 loss_count=103 noloss_count=0 min_loss=0.242026 max_loss=2.83619 overall_loss=0.767731838409
worker=worker-1-24 loss_count=110 noloss_count=0 min_loss=0.235739 max_loss=2.838526 overall_loss=0.72491227295
worker=worker-1-25 loss_count=107 noloss_count=0 min_loss=0.240525 max_loss=2.829691 overall_loss=0.749225792647
worker=worker-1-26 loss_count=103 noloss_count=0 min_loss=0.24124 max_loss=2.829044 overall_loss=0.762315842219
worker=worker-1-27 loss_count=114 noloss_count=0 min_loss=0.241884 max_loss=2.818178 overall_loss=0.729974786019
worker=worker-1-28 loss_count=107 noloss_count=0 min_loss=0.225381 max_loss=2.83352 overall_loss=0.74448587852
worker=worker-1-29 loss_count=103 noloss_count=0 min_loss=0.239525 max_loss=2.803074 overall_loss=0.757284890628
worker=worker-1-3 loss_count=114 noloss_count=0 min_loss=0.219728 max_loss=2.80548 overall_loss=0.719031297562
worker=worker-1-30 loss_count=111 noloss_count=0 min_loss=0.235454 max_loss=2.833323 overall_loss=0.728580085593
worker=worker-1-31 loss_count=104 noloss_count=0 min_loss=0.249393 max_loss=2.802944 overall_loss=0.756920450524
worker=worker-1-32 loss_count=102 noloss_count=0 min_loss=0.232239 max_loss=2.823435 overall_loss=0.746727375956
worker=worker-1-33 loss_count=114 noloss_count=0 min_loss=0.245892 max_loss=2.823399 overall_loss=0.736399137185
worker=worker-1-34 loss_count=106 noloss_count=0 min_loss=0.222024 max_loss=2.813374 overall_loss=0.734090574376
worker=worker-1-35 loss_count=103 noloss_count=0 min_loss=0.219738 max_loss=2.830536 overall_loss=0.75132105636
worker=worker-1-36 loss_count=114 noloss_count=0 min_loss=0.242244 max_loss=2.810973 overall_loss=0.732583490263
worker=worker-1-37 loss_count=107 noloss_count=0 min_loss=0.219657 max_loss=2.804831 overall_loss=0.732532430144
worker=worker-1-38 loss_count=103 noloss_count=0 min_loss=0.219585 max_loss=2.808788 overall_loss=0.747543372035
worker=worker-1-39 loss_count=114 noloss_count=0 min_loss=0.219585 max_loss=2.803075 overall_loss=0.716905805628
worker=worker-1-4 loss_count=107 noloss_count=0 min_loss=0.219728 max_loss=2.808918 overall_loss=0.736058042389
worker=worker-1-40 loss_count=107 noloss_count=0 min_loss=0.219585 max_loss=2.8021 overall_loss=0.73236088495
worker=worker-1-41 loss_count=103 noloss_count=0 min_loss=0.219585 max_loss=2.802619 overall_loss=0.746214054901
worker=worker-1-42 loss_count=113 noloss_count=0 min_loss=0.219442 max_loss=2.809179 overall_loss=0.716488984394
worker=worker-1-43 loss_count=107 noloss_count=0 min_loss=0.219442 max_loss=2.799245 overall_loss=0.732172697609
worker=worker-1-44 loss_count=103 noloss_count=0 min_loss=0.219513 max_loss=2.799373 overall_loss=0.744807113628
worker=worker-1-45 loss_count=114 noloss_count=0 min_loss=0.219441 max_loss=2.80574 overall_loss=0.716654868989
worker=worker-1-46 loss_count=107 noloss_count=0 min_loss=0.219585 max_loss=2.806646 overall_loss=0.732673849923
error: worker=worker-1-47 loss_count=42 noloss_count=0 min_loss=0.252768 max_loss=2.803858 overall_loss=1.00735434535
worker=worker-1-48 loss_count=55 noloss_count=0 min_loss=0.252767 max_loss=2.803207 overall_loss=0.874731142836
error: worker=worker-1-49 loss_count=20 noloss_count=0 min_loss=0.827054 max_loss=2.806194 overall_loss=1.3206719637
worker=worker-1-5 loss_count=103 noloss_count=0 min_loss=0.219585 max_loss=2.809826 overall_loss=0.749445632038
error: worker=worker-1-50 loss_count=20 noloss_count=0 min_loss=0.828338 max_loss=2.807944 overall_loss=1.32179609476
worker=worker-1-6 loss_count=114 noloss_count=0 min_loss=0.219585 max_loss=2.810032 overall_loss=0.720953328173
worker=worker-1-7 loss_count=107 noloss_count=0 min_loss=0.219442 max_loss=2.803858 overall_loss=0.733033867296
worker=worker-1-8 loss_count=103 noloss_count=0 min_loss=0.219585 max_loss=2.809439 overall_loss=0.749667466957
worker=worker-1-9 loss_count=114 noloss_count=0 min_loss=0.219585 max_loss=2.804442 overall_loss=0.718187501869


################################################################
# Checking what percentage of recent tcp connections show loss #
################################################################
error: No conn log files in the past day???


###################################################################
# Checking if connections are unevenly distributed across workers #
###################################################################
error: No conn log files in the past day???


###############################################################################################################################
# Checking if anything is in the deprecated local-logger.bro, local-manager.bro, local-proxy.bro, or local-worker.bro scripts #
###############################################################################################################################
Nothing found


######################################################################
# Checking if any recent connections have been logged multiple times #
######################################################################
error: No conn log files in the past day???


############################################################################
# Checking what percentage of recent tcp connections are remote to remote. #
############################################################################
error: No conn log files in the past day???


###############################################################################
# Checking if bro is linked against a custom malloc like tcmalloc or jemalloc #
###############################################################################
error: configured to use a custom malloc=False


##################################
# Checking pf_ring configuration #
##################################
error: bro binary on node worker-1-1 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-2 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-3 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-4 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-5 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-6 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-7 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-8 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-9 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-10 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-11 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-12 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-13 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-14 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-15 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-16 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-17 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-18 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-19 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-20 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-21 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-22 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-23 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-24 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-25 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-26 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-27 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-28 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-29 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-30 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-31 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-32 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-33 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-34 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-35 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-36 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-37 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-38 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-39 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-40 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-41 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-42 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-43 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-44 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-45 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-46 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-47 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-48 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: configured to use pf_ring=True pcap=False plugin=False


############################################
# Checking for recent reporter.log entries #
############################################
error: Found 48 reporter log files in the past 7 days
Recent reporter.log messages:
Reporter::INFO processing suspended
Reporter::INFO processing suspended
446 duplicate messages suppressed
Reporter::INFO processing continued
Reporter::INFO processing continued
533 duplicate messages suppressed
Reporter::INFO received termination signal
Reporter::INFO received termination signal
15 duplicate messages suppressed




-----Original Message-----
From: Azoff, Justin S <***@illinois.edu>
Sent: Tuesday, August 21, 2018 4:08 PM
To: Ron McClellan <***@ao.uscourts.gov>
Cc: ***@bro.org
Subject: Re: [Bro] BRO Logger crashing due to large DNS log files
Post by Ron McClellan
Justin,
The first 5 lines are consistent, the last 2 lines the first time seen were today. Crash report wasn't very useful (see below), diag was pretty much the same. Hopefully the OOM message helps.
Ron
Aug 21 09:45:18 aosoc kernel: Out of memory: Kill process 6610 (bro)
score 507 or sacrifice child Aug 21 09:45:18 aosoc kernel: Killed
process 6610 (bro) total-vm:139995144kB, anon-rss:137467264kB,
file-rss:0kB, shmem-rss:0kB Aug 21 11:32:23 aosoc kernel: bro invoked
oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Aug 21 11:32:23
aosoc kernel: bro cpuset=/ mems_allowed=0-1 Aug 21 11:32:23 aosoc
kernel: CPU: 57 PID: 21655 Comm: bro Kdump: loaded Not tainted
3.10.0-862.11.6.el7.x86_64 #1 Aug 21 11:32:23 aosoc kernel: Out of
memory: Kill process 20158 (bro) score 544 or sacrifice child Aug 21
11:32:23 aosoc kernel: Killed process 20158 (bro)
total-vm:150275592kB, anon-rss:147621508kB, file-rss:0kB,
shmem-rss:0kB
Ah, this is great.. well, not great in that it is crashing but great in that now we know what is wrong: You're running out of ram.

So you said you had 256GB, which should normally be more than enough as long as everything is working properly, but I have a feeling some things are not working quite right though.

Have you had a chance to run that python program I posted? If you have a high amount of log lag, something is not keeping up well.

Do you have any graphs of memory usage on your host?

What exactly does this output:

$ cat /proc/cpuinfo |grep 'model name'|sort|uniq -c
40 model name : Intel(R) Xeon(R) CPU E5-2470 v2 @ 2.40GHz


The fact that you are seeing

34264380 dns_unmatched_msg
16696030 dns_unmatched_reply
62288 possible_split_routing
59512 data_before_established

in your weird.log points to something being very very wrong with your traffic. This can cause bro to work many times harder than it needs to.

How is your load balancing setup in your node.cfg?

Can you try running bro-doctor from bro-pkg: https://packages.bro.org/packages/view/74d45e8c-4fb7-11e8-88be-0a645a3f3086

If you can't run bro-pkg you just need to grab

https://raw.githubusercontent.com/ncsa/bro-doctor/master/doctor.py

and drop it in

/usr/local/bro/lib/broctl/plugins/doctor.py

and run broctl doctor.bro




Justin Azoff
Azoff, Justin S
2018-08-21 20:43:27 UTC
Permalink
Post by Ron McClellan
Thanks Justin, here is the info.
Ron
ok so there's a lot of stuff wrong here, but it's all fixable.
Post by Ron McClellan
######################Memory CPUI################################################################
total used free shared buff/cache available
Mem: 263620592 241469748 861800 4464 21289044 21038584
Swap: 4194300 80508 4113792
[worker-1]
type=worker
host=localhost
interface=ens1f0
lb_method=pf_ring
lb_procs=48
pin_cpus=12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60
This is an absolute beast of a CPU, but you only have 36 real cores. I would run something like 30 workers, not 48.
You also have to ensure you don't pin workers to 'fake' hyperthreading cores. Depending on how the CPU gets
enumerated you either need to use something like 6 through 35, or every other cpu. 6-35 is probably right for your CPU though.
This can be verified using hwloc-ps in the hwloc package.. I want to add this to bro-doctor but it's.. tricky :-)

You may also have 2 NUMA nodes in that case you would probably want to use 15 workers on each NUMA node, which would be something like
3-18 and 21-36... very tricky. Ideally broctl could just support pin_cpus=auto as well :-)
Post by Ron McClellan
######################Perl Script OutputI################################################################
lag.value 500.000000
This isn't great.. but about what would be expected based on running out of ram.
Post by Ron McClellan
#################################################################
# Checking if many recent connections have a SAD or had history #
#################################################################
error: No conn log files in the past day???
If you disabled the conn.log then most of the checks that bro-doctor does can't be ran :-(
Post by Ron McClellan
###############################################################################
# Checking if bro is linked against a custom malloc like tcmalloc or jemalloc #
###############################################################################
error: configured to use a custom malloc=False
I would install gperftools-devel and rebuild bro
Post by Ron McClellan
##################################
# Checking pf_ring configuration #
##################################
error: bro binary on node worker-1-1 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-2 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
..
error: configured to use pf_ring=True pcap=False plugin=False
This is super super bad! This means that every bro processes is seeing all of your traffic.. all 48 of them!

You can fix pf_ring, but on a recent Centos7 system AF_Packet works great out of the box, so I would try using that first
instead. You just need to

bro-pkg install j-gras/bro-af_packet-plugin

and then in your node.cfg use

[worker-1]
type=worker
host=localhost
interface=af_packet::ens1f0
lb_method=custom
lb_procs=30
pin_cpus=6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35
af_packet_fanout_id=22
af_packet_fanout_mode=AF_Packet::FANOUT_HASH


Justin Azoff
Azoff, Justin S
2018-08-21 21:52:34 UTC
Permalink
Post by Azoff, Justin S
You may also have 2 NUMA nodes in that case you would probably want to use 15 workers on each NUMA node, which would be something like
3-18 and 21-36... very tricky. Ideally broctl could just support pin_cpus=auto as well :-)
I double checked this to verify I was saying the right thing, and looking closer at all our boxes that are NUMA it seems the cpus get allocated in round robin between them, so

0,2,4,6,8,10....34 go to numa node 0, then
1,3,5,7,9,11....35 go to numa node 1

But I think that can depend on kernel version and I swear this changed between centos6 and 7.

This matters more if you have 2 cards you can capture from though, since most likely one card is attached to each numa node.
If you only have one 10g interface you are capturing from it doesn't matter as much.

$ hwloc-ls -p
Machine (64GB total)
NUMANode P#0 (32GB)
Package P#0 + L3 (14MB)
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#0
PU P#0
PU P#20
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#4
PU P#2
PU P#22
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#1
PU P#4
PU P#24
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#3
PU P#6
PU P#26
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#2
PU P#8
PU P#28
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#12
PU P#10
PU P#30
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#8
PU P#12
PU P#32
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#11
PU P#14
PU P#34
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#9
PU P#16
PU P#36
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#10
PU P#18
PU P#38
HostBridge P#2
PCIBridge
PCI 8086:1584
Net "p2p1"
NUMANode P#1 (32GB)
Package P#1 + L3 (14MB)
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#0
PU P#1
PU P#21
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#4
PU P#3
PU P#23
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#1
PU P#5
PU P#25
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#3
PU P#7
PU P#27
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#2
PU P#9
PU P#29
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#12
PU P#11
PU P#31
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#8
PU P#13
PU P#33
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#11
PU P#15
PU P#35
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#9
PU P#17
PU P#37
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#10
PU P#19
PU P#39
HostBridge P#6
PCIBridge
PCI 8086:1584
Net "p3p1"

This gets really confusing but it shows that p2p1 is connected to the first numa node and has real cpus 0,2,4,8... and hyperthread cores 20,22,24,26...

For boxes like that, the configuration would look something like

[...p2p1]
type=worker
host=..
interface=af_packet::p2p1
lb_method=custom
lb_procs=8
pin_cpus=4,6,8,10,12,14,16,18
af_packet_fanout_id=21
af_packet_fanout_mode=AF_Packet::FANOUT_HASH

[...p3p1]
type=worker
host=..
interface=af_packet::p3p1
lb_method=custom
lb_procs=8
pin_cpus=5,7,9,11,13,15,17,19
af_packet_fanout_id=22
af_packet_fanout_mode=AF_Packet::FANOUT_HASH




Justin Azoff
Ron McClellan
2018-08-21 22:10:58 UTC
Permalink
Justin,

I finished most of your recommendations, just need to rebuild bro, but was going to let it run over night and see how it is running now. I really appreciate all the help.

Thanks,

Ron



]# hwloc-ls -p
Machine (256GB total)
NUMANode P#0 (128GB)
Package P#0 + L3 (25MB)
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#0
PU P#0
PU P#36
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#1
PU P#1
PU P#37
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#2
PU P#2
PU P#38
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#3
PU P#3
PU P#39
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#4
PU P#4
PU P#40
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#8
PU P#5
PU P#41
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#9
PU P#6
PU P#42
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#10
PU P#7
PU P#43
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#11
PU P#8
PU P#44
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#16
PU P#9
PU P#45
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#17
PU P#10
PU P#46
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#18
PU P#11
PU P#47
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#19
PU P#12
PU P#48
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#20
PU P#13
PU P#49
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#24
PU P#14
PU P#50
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#25
PU P#15
PU P#51
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#26
PU P#16
PU P#52
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#27
PU P#17
PU P#53
HostBridge P#0
PCIBridge
PCI 14e4:1657
Net "eno1"
PCI 14e4:1657
Net "eno2"
PCI 14e4:1657
Net "eno3"
PCI 14e4:1657
Net "eno4"
PCIBridge
PCI 102b:0538
GPU "card0"
GPU "controlD64"
HostBridge P#1
PCIBridge
PCI 8086:1572
Net "ens1f0"
PCI 8086:1572
Net "ens1f1"
PCIBridge
PCI 8086:1572
Net "ens3f0"
PCI 8086:1572
Net "ens3f1"
HostBridge P#2
PCIBridge
PCI 8086:1572
Net "ens2f0"
PCI 8086:1572
Net "ens2f1"
HostBridge P#3
PCIBridge
PCI 9005:028f
Block(Disk) "sda"
Block(Disk) "sdc"
NUMANode P#1 (128GB) + Package P#1 + L3 (25MB)
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#0
PU P#18
PU P#54
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#1
PU P#19
PU P#55
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#2
PU P#20
PU P#56
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#3
PU P#21
PU P#57
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#4
PU P#22
PU P#58
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#8
PU P#23
PU P#59
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#9
PU P#24
PU P#60
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#10
PU P#25
PU P#61
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#11
PU P#26
PU P#62
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#16
PU P#27
PU P#63
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#17
PU P#28
PU P#64
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#18
PU P#29
PU P#65
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#19
PU P#30
PU P#66
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#20
PU P#31
PU P#67
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#24
PU P#32
PU P#68
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#25
PU P#33
PU P#69
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#26
PU P#34
PU P#70
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#27
PU P#35
PU P#71

-----Original Message-----
From: Azoff, Justin S <***@illinois.edu>
Sent: Tuesday, August 21, 2018 5:53 PM
To: Ron McClellan <***@ao.uscourts.gov>
Cc: ***@bro.org
Subject: Re: [Bro] BRO Logger crashing due to large DNS log files
Post by Azoff, Justin S
You may also have 2 NUMA nodes in that case you would probably want to use 15 workers on each NUMA node, which would be something like
3-18 and 21-36... very tricky. Ideally broctl could just support pin_cpus=auto as well :-)
I double checked this to verify I was saying the right thing, and looking closer at all our boxes that are NUMA it seems the cpus get allocated in round robin between them, so

0,2,4,6,8,10....34 go to numa node 0, then
1,3,5,7,9,11....35 go to numa node 1

But I think that can depend on kernel version and I swear this changed between centos6 and 7.

This matters more if you have 2 cards you can capture from though, since most likely one card is attached to each numa node.
If you only have one 10g interface you are capturing from it doesn't matter as much.

$ hwloc-ls -p
Machine (64GB total)
NUMANode P#0 (32GB)
Package P#0 + L3 (14MB)
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#0
PU P#0
PU P#20
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#4
PU P#2
PU P#22
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#1
PU P#4
PU P#24
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#3
PU P#6
PU P#26
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#2
PU P#8
PU P#28
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#12
PU P#10
PU P#30
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#8
PU P#12
PU P#32
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#11
PU P#14
PU P#34
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#9
PU P#16
PU P#36
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#10
PU P#18
PU P#38
HostBridge P#2
PCIBridge
PCI 8086:1584
Net "p2p1"
NUMANode P#1 (32GB)
Package P#1 + L3 (14MB)
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#0
PU P#1
PU P#21
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#4
PU P#3
PU P#23
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#1
PU P#5
PU P#25
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#3
PU P#7
PU P#27
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#2
PU P#9
PU P#29
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#12
PU P#11
PU P#31
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#8
PU P#13
PU P#33
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#11
PU P#15
PU P#35
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#9
PU P#17
PU P#37
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#10
PU P#19
PU P#39
HostBridge P#6
PCIBridge
PCI 8086:1584
Net "p3p1"

This gets really confusing but it shows that p2p1 is connected to the first numa node and has real cpus 0,2,4,8... and hyperthread cores 20,22,24,26...

For boxes like that, the configuration would look something like

[...p2p1]
type=worker
host=..
interface=af_packet::p2p1
lb_method=custom
lb_procs=8
pin_cpus=4,6,8,10,12,14,16,18
af_packet_fanout_id=21
af_packet_fanout_mode=AF_Packet::FANOUT_HASH

[...p3p1]
type=worker
host=..
interface=af_packet::p3p1
lb_method=custom
lb_procs=8
pin_cpus=5,7,9,11,13,15,17,19
af_packet_fanout_id=22
af_packet_fanout_mode=AF_Packet::FANOUT_HASH




Justin Azoff
Azoff, Justin S
2018-08-22 01:46:19 UTC
Permalink
Post by Ron McClellan
I finished most of your recommendations, just need to rebuild bro, but was going to let it run over night and see how it is running now. I really appreciate all the help.
Great! There may be more things to fix, but once that load balancing is working properly things will be in a lot better shape.
Post by Ron McClellan
]# hwloc-ls -p
Machine (256GB total)
NUMANode P#0 (128GB)
Package P#0 + L3 (25MB)
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#0
PU P#0 <----
PU P#36
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#1
PU P#1 <---
PU P#37
You have CPU 0,1,2,3.. on the same numa node, but every box I have puts 0,2,4... on one and 1,3,5... on the other.

Machine (64GB total)
NUMANode P#0 (32GB)
Package P#0 + L3 (14MB)
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#0
PU P#0 <---
PU P#20
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#4
PU P#2 <---
PU P#22

All the more reason for me to get bro-doctor to do this analysis and confirm the proper pin_cpus values are being used.


Justin Azoff
Ron McClellan
2018-08-22 14:48:01 UTC
Permalink
Justin,

Got good news and solid progress with your help. BRO is running on both boxes and hasn't crashed since 10pm last night. If I read the data about NUMA from my systems, I don't really need to split the load between 2 workers as you did, right? I'm working on tuning some now and also trying to address the really high lag (500) that I'm still seeing. Currently seeing some loss on it, but will continue to tune and see what if I can get that under control. Let me know if you need help testing the doctor script.

Ron


# cat capture_loss.log
#separator \x09
#set_separator ,
#empty_field (empty)
#unset_field -
#path capture_loss
#open 2018-08-22-10-01-21
#fields ts ts_delta peer gaps acks percent_lost
#types time interval string count count double
1534946481.938006 900.000084 worker-1-20 33 696 4.741379
1534946481.941548 900.000000 worker-1-24 20 2722 0.734754
1534946481.938533 900.000059 worker-1-21 630 40222 1.566307
1534946481.938396 900.000070 worker-1-9 89 1470 6.054422
1534946481.941452 900.000044 worker-1-8 156 1821 8.566722
1534946481.941323 900.000017 worker-1-12 1062 232679 0.456423
1534946481.939547 900.000037 worker-1-27 1023 216063 0.473473
1534946481.937269 900.000040 worker-1-10 749 5465 13.705398
1534946481.937517 900.000111 worker-1-3 87 15720 0.553435
1534946481.941367 900.000649 worker-1-16 117 2187 5.349794
1534946481.939451 900.000079 worker-1-7 870 195358 0.445336
1534946481.940450 900.000041 worker-1-5 111 626 17.731629
1534946481.931345 900.000019 worker-1-4 44 885 4.971751
1534946481.941268 900.000074 worker-1-17 131 1641 7.982937
1534946481.946945 900.000039 worker-1-18 189 1350 14.0
1534946481.941532 900.000083 worker-1-25 118 9414 1.253452
1534946481.942680 900.000094 worker-1-30 1375 2635 52.182163
1534946481.937385 900.000074 worker-1-1 1050 232183 0.452229
1534946481.939621 900.000062 worker-1-26 20 1973 1.013685
1534946481.942331 900.000127 worker-1-2 1236 240350 0.51425
1534946481.938535 900.000003 worker-1-29 133 2923 4.55012
1534946481.938737 900.000077 worker-1-13 1463 223976 0.653195
1534946481.937868 900.000121 worker-1-15 278 2360 11.779661
1534946481.937738 900.000006 worker-1-28 36 765 4.705882
1534946481.940076 900.000039 worker-1-23 43 3749 1.146973
1534946481.940530 900.000008 worker-1-22 1151 4798 23.989162
1534946481.944632 900.000030 worker-1-19 510 88481 0.576395
1534946481.937329 900.000045 worker-1-6 891 146039 0.610111
1534946481.938533 900.000095 worker-1-14 206 2276 9.050967
1534946481.937384 900.000074 worker-1-11 222 2176 10.202206
1534947381.938548 900.000013 worker-1-29 1135 241449 0.470079
1534947381.942682 900.000002 worker-1-30 399 13150 3.034221
1534947381.939458 900.000007 worker-1-7 332 66504 0.499218
1534947381.937742 900.000004 worker-1-28 31 711 4.360056
1534947381.940622 900.000092 worker-1-22 77 1728 4.456019
1534947381.938073 900.000067 worker-1-20 103 2343 4.396073
1534947381.941622 900.000074 worker-1-24 90 7394 1.217203
1534947381.941549 900.000017 worker-1-25 1259 235553 0.534487
1534947381.941454 900.000087 worker-1-16 231 5455 4.234647
1534947381.942399 900.000068 worker-1-2 69 1293 5.336427
1534947381.941324 900.000056 worker-1-17 152 759 20.02635
1534947381.931395 900.000050 worker-1-4 1310 240018 0.545792
1534947381.938810 900.000073 worker-1-13 109 17301 0.630021
1534947381.938606 900.000073 worker-1-14 305 2184 13.965201
1534947381.937398 900.000069 worker-1-6 67 3465 1.933622
1534947381.940457 900.000007 worker-1-5 118 1280 9.21875
1534947381.937470 900.000085 worker-1-1 24 1581 1.518027
1534947381.940195 900.000119 worker-1-23 189 20872 0.905519
1534947381.937614 900.000097 worker-1-3 1167 213001 0.547885
1534947381.944751 900.000119 worker-1-19 160 4249 3.765592
1534947381.937943 900.000075 worker-1-15 593 2541 23.337269
1534947381.947066 900.000121 worker-1-18 809 160344 0.50454
1534947381.939548 900.000001 worker-1-27 219 2612 8.38438
1534947381.938628 900.000095 worker-1-21 302 1627 18.56177
1534947381.937326 900.000057 worker-1-10 107 1763 6.0692
1534947381.938497 900.000101 worker-1-9 1599 238664 0.66998
1534947381.941398 900.000075 worker-1-12 201 2936 6.846049
1534947381.937399 900.000015 worker-1-11 1382 236433 0.584521
1534947381.939677 900.000056 worker-1-26 52 1100 4.727273
1534947381.941453 900.000001 worker-1-8 224 1601 13.991255
1534948281.939548 900.000090 worker-1-7 1088 235524 0.461949
1534948281.941678 900.000129 worker-1-25 202 32683 0.618058
1534948281.947198 900.000132 worker-1-18 284 6208 4.574742
1534948281.937477 900.000079 worker-1-6 70 14679 0.476872
1534948281.937532 900.000062 worker-1-1 57 1621 3.516348
1534948281.937477 900.000078 worker-1-11 71 24940 0.284683
1534948281.938938 900.000128 worker-1-13 111 12288 0.90332
1534948281.941679 900.000057 worker-1-24 731 121315 0.602564
1534948281.938621 900.000015 worker-1-14 1056 230109 0.458913
1534948281.942751 900.000069 worker-1-30 34 448 7.589286
1534948281.938548 900.000000 worker-1-29 219 1033 21.200387
1534948281.941325 900.000001 worker-1-17 671 111097 0.603977
1534948281.937348 900.000022 worker-1-10 145 1917 7.563902
1534948281.938055 900.000112 worker-1-15 859 187429 0.458307
1534948281.939622 900.000074 worker-1-27 50 3453 1.448016
1534948281.931396 900.000001 worker-1-4 193 3759 5.134344
1534948281.937780 900.000038 worker-1-28 230 6086 3.779165
1534948281.938109 900.000036 worker-1-20 1135 230316 0.492801
1534948281.938512 900.000015 worker-1-9 44 3888 1.131687
1534948281.940323 900.000128 worker-1-23 30 1212 2.475248
1534948281.939677 900.000000 worker-1-26 165 6336 2.604167
1534948281.940527 900.000070 worker-1-5 96 5162 1.859744
1534948281.937736 900.000122 worker-1-3 1123 249305 0.450452
1534948281.941454 900.000001 worker-1-8 67 1910 3.507853
1534948281.940679 900.000057 worker-1-22 115 4310 2.668213
1534948281.938677 900.000049 worker-1-21 25 2141 1.167679
1534948281.944879 900.000128 worker-1-19 29 1637 1.771533
1534948281.942454 900.000055 worker-1-2 36 2033 1.770782
1534948281.941453 900.000055 worker-1-12 26 991 2.623613
1534948281.941454 900.000000 worker-1-16 1127 230791 0.488321

cat capture_loss.log
#separator \x09
#set_separator ,
#empty_field (empty)
#unset_field -
#path capture_loss
#open 2018-08-22-10-06-13
#fields ts ts_delta peer gaps acks percent_lost
#types time interval string count count double
1534946772.685666 900.000108 worker-1-9 71276 209039 34.096987
1534946772.682117 900.000110 worker-1-20 43286 430827 10.047188
1534946772.686758 900.000020 worker-1-22 58337 172653 33.788582
1534946772.689750 900.000013 worker-1-17 61579 422200 14.585268
1534946772.683422 900.000599 worker-1-4 62846 224500 27.993764
1534946772.692533 900.000076 worker-1-13 56519 190555 29.660203
1534946772.684749 900.000086 worker-1-15 41612 129870 32.041272
1534946772.684889 900.000230 worker-1-27 76559 187163 40.904987
1534946772.683731 900.000001 worker-1-25 74450 188407 39.515517
1534946772.681934 900.000111 worker-1-5 50253 153355 32.769065
1534946772.682021 900.000012 worker-1-28 52191 151854 34.369197
1534946772.682825 900.000074 worker-1-8 52037 190660 27.293087
1534946772.699409 900.000084 worker-1-16 88137 266670 33.050962
1534946772.685734 900.000100 worker-1-30 51271 238600 21.488265
1534946772.682739 900.000022 worker-1-6 66273 250566 26.449319
1534946772.682741 900.000063 worker-1-26 49902 153687 32.46989
1534946772.681960 900.000006 worker-1-1 89188 255018 34.973218
1534946772.682631 900.000622 worker-1-29 60705 210476 28.841768
1534946772.681953 900.000075 worker-1-2 38281 125211 30.573192
1534946772.682673 900.000007 worker-1-3 67450 187531 35.967387
1534946772.686732 900.000060 worker-1-23 55932 191885 29.148709
1534946772.681828 900.000005 worker-1-7 66947 445007 15.044033
1534946772.681886 900.000007 worker-1-11 48944 138084 35.445091
1534946772.693528 900.000000 worker-1-14 65762 188557 34.876456
1534946772.681885 900.000006 worker-1-10 62149 428124 14.516589
1534946772.685697 900.000017 worker-1-21 48039 147640 32.53793
1534946772.683753 900.000022 worker-1-19 59660 157172 37.958415
1534946772.705397 900.000127 worker-1-24 71820 223813 32.089289
1534946772.688718 900.000117 worker-1-18 48410 452562 10.696877
1534946772.685511 900.000137 worker-1-12 46673 145455 32.087587
1534947672.682048 900.000114 worker-1-5 68107 180382 37.757093
1534947672.683025 900.000286 worker-1-6 45761 183027 25.002322
1534947672.685750 900.000053 worker-1-21 50836 422213 12.040368
1534947672.683879 900.000126 worker-1-19 53010 178899 29.631244
1534947672.693643 900.000115 worker-1-14 92038 425392 21.636044
1534947672.682825 900.000084 worker-1-26 55076 176437 31.215675
1534947672.682008 900.000123 worker-1-10 73148 207138 35.313656
1534947672.699475 900.000066 worker-1-16 72461 223957 32.354872
1534947672.684952 900.000063 worker-1-27 47858 167864 28.509984
1534947672.686884 900.000126 worker-1-22 65305 192727 33.884718
1534947672.681973 900.000020 worker-1-2 60511 181325 33.37157
1534947672.682136 900.000176 worker-1-1 109592 280275 39.101597
1534947672.682749 900.000118 worker-1-29 64164 192112 33.399267
1534947672.689756 900.000006 worker-1-17 61667 166246 37.093825
1534947672.683803 900.000072 worker-1-25 56366 464877 12.124928
1534947672.682152 900.000035 worker-1-20 49701 148229 33.529876
1534947672.685826 900.000092 worker-1-30 54071 160228 33.746287
1534947672.684823 900.000074 worker-1-15 60758 204305 29.738871
1534947672.685527 900.000016 worker-1-12 51410 166297 30.914569
1534947672.688722 900.000004 worker-1-18 73693 218226 33.76912
1534947672.682082 900.000061 worker-1-28 62184 198747 31.288019
1534947672.686826 900.000094 worker-1-23 57861 221752 26.092662
1534947672.682903 900.000078 worker-1-8 48482 219779 22.059432
1534947672.685711 900.000045 worker-1-9 53372 172244 30.986275
1534947672.692602 900.000069 worker-1-13 62358 502957 12.398277
1534947672.682167 900.000281 worker-1-11 48767 198101 24.617241
1534947672.705447 900.000050 worker-1-24 55112 186729 29.51443
1534947672.682731 900.000058 worker-1-3 56891 162845 34.935675
1534947672.683487 900.000065 worker-1-4 78602 255868 30.719746
1534947672.681880 900.000052 worker-1-7 51099 541967 9.428434
1534948572.682094 900.000086 worker-1-10 82032 524780 15.631693
1534948572.693667 900.000024 worker-1-14 85369 297217 28.722785
1534948572.682472 900.000499 worker-1-2 53654 221056 24.271678
1534948572.686886 900.000002 worker-1-22 55666 467706 11.901921
1534948572.685008 900.000056 worker-1-27 86916 263647 32.966808
1534948572.682279 900.000127 worker-1-20 89828 256003 35.088651
1534948572.682223 900.000087 worker-1-1 62337 344970 18.070267
1534948572.685750 900.000000 worker-1-21 70389 510644 13.784359
1534948572.684880 900.000057 worker-1-15 67459 206447 32.676183
1534948572.685740 900.000029 worker-1-9 57163 227031 25.1785
1534948572.682752 900.000021 worker-1-3 61958 204039 30.365763
1534948572.682835 900.000010 worker-1-26 54506 196350 27.759613
1534948572.683153 900.000128 worker-1-6 60501 190365 31.781577
1534948572.682183 900.000016 worker-1-11 63835 191625 33.312459
1534948572.682208 900.000126 worker-1-28 91876 284589 32.28375
1534948572.683828 900.000025 worker-1-25 44239 139128 31.797338
1534948572.685880 900.000054 worker-1-30 55616 172434 32.2535
1534948572.689884 900.000128 worker-1-17 69725 178142 39.140124
1534948572.681961 900.000081 worker-1-7 53776 220472 24.391306
1534948572.683937 900.000058 worker-1-19 50184 163270 30.736816
1534948572.685538 900.000011 worker-1-12 60185 260306 23.120865
1534948572.686889 900.000063 worker-1-23 59788 194439 30.748975
1534948572.682908 900.000005 worker-1-8 60904 532647 11.434214
1534948572.692674 900.000072 worker-1-13 67152 216975 30.949188
1534948572.688750 900.000028 worker-1-18 70383 235710 29.859997
1534948572.705484 900.000037 worker-1-24 57008 201189 28.335545
1534948572.682147 900.000099 worker-1-5 61878 194825 31.760811
1534948572.699536 900.000061 worker-1-16 76385 256671 29.759887
1534948572.682829 900.000080 worker-1-29 52464 188150 27.884135
1534948572.683536 900.000049 worker-1-4 110222 314119 35.08925

[***@aosoc current]# broctl netstats
worker-1-1: 1534949053.166850 recvd=813997 dropped=0 link=813997
worker-1-2: 1534949053.366803 recvd=873351 dropped=0 link=873353
worker-1-3: 1534949053.567778 recvd=1770808 dropped=0 link=1770810
worker-1-4: 1534949053.767852 recvd=865443 dropped=0 link=865449
worker-1-5: 1534949053.968873 recvd=349355 dropped=0 link=349361
worker-1-6: 1534949054.168785 recvd=1152160 dropped=0 link=1152161
worker-1-7: 1534949054.368825 recvd=1358553 dropped=0 link=1358553
worker-1-8: 1534949054.569808 recvd=345267 dropped=0 link=345272
worker-1-9: 1534949054.769982 recvd=856725 dropped=0 link=856732
worker-1-10: 1534949054.969811 recvd=351148 dropped=0 link=351148
worker-1-11: 1534949055.170855 recvd=883897 dropped=0 link=883897
worker-1-12: 1534949055.370950 recvd=820117 dropped=0 link=820125
worker-1-13: 1534949055.571899 recvd=1132465 dropped=0 link=1132473
worker-1-14: 1534949055.771751 recvd=823249 dropped=0 link=823249
worker-1-15: 1534949055.972921 recvd=754342 dropped=0 link=754343
worker-1-16: 1534949056.173778 recvd=822102 dropped=0 link=822106
worker-1-17: 1534949056.373806 recvd=570905 dropped=0 link=570911
worker-1-18: 1534949056.573815 recvd=1033845 dropped=0 link=1033846
worker-1-19: 1534949056.774737 recvd=648977 dropped=0 link=649001
worker-1-20: 1534949056.974823 recvd=816836 dropped=0 link=816838
worker-1-21: 1534949057.175858 recvd=423896 dropped=0 link=423901
worker-1-22: 1534949057.375894 recvd=761794 dropped=0 link=761796
worker-1-23: 1534949057.576737 recvd=415151 dropped=0 link=415153
worker-1-24: 1534949057.776887 recvd=604342 dropped=0 link=604349
worker-1-25: 1534949057.978046 recvd=911772 dropped=0 link=911785
worker-1-26: 1534949058.177749 recvd=358386 dropped=0 link=358395
worker-1-27: 1534949058.379062 recvd=1283463 dropped=0 link=1283465
worker-1-28: 1534949058.578751 recvd=364801 dropped=0 link=364807
worker-1-29: 1534949058.778735 recvd=930041 dropped=0 link=930042
worker-1-30: 1534949058.979938 recvd=857963 dropped=0 link=857967



-----Original Message-----
From: Azoff, Justin S <***@illinois.edu>
Sent: Tuesday, August 21, 2018 9:46 PM
To: Ron McClellan <***@ao.uscourts.gov>
Cc: ***@bro.org
Subject: Re: [Bro] BRO Logger crashing due to large DNS log files
Post by Ron McClellan
I finished most of your recommendations, just need to rebuild bro, but was going to let it run over night and see how it is running now. I really appreciate all the help.
Great! There may be more things to fix, but once that load balancing is working properly things will be in a lot better shape.
Post by Ron McClellan
]# hwloc-ls -p
Machine (256GB total)
NUMANode P#0 (128GB)
Package P#0 + L3 (25MB)
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#0
PU P#0 <----
PU P#36
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#1
PU P#1 <---
PU P#37
You have CPU 0,1,2,3.. on the same numa node, but every box I have puts 0,2,4... on one and 1,3,5... on the other.

Machine (64GB total)
NUMANode P#0 (32GB)
Package P#0 + L3 (14MB)
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#0
PU P#0 <---
PU P#20
L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#4
PU P#2 <---
PU P#22

All the more reason for me to get bro-doctor to do this analysis and confirm the proper pin_cpus values are being used.


Justin Azoff
Azoff, Justin S
2018-08-22 14:58:35 UTC
Permalink
Post by Ron McClellan
Justin,
Got good news and solid progress with your help. BRO is running on both boxes and hasn't crashed since 10pm last night. If I read the data about NUMA from my systems, I don't really need to split the load between 2 workers as you did, right?
If you can get another NIC so each box has 2, then you could divide the workers between each NIC and NUMA node. Otherwise it doesn't matter so much.
Post by Ron McClellan
I'm working on tuning some now and also trying to address the really high lag (500) that I'm still seeing. Currently seeing some loss on it, but will continue to tune and see what if I can get that under control. Let me know if you need help testing the doctor script.
Ron
1534948572.682908 900.000005 worker-1-8 60904 532647 11.434214
1534948572.692674 900.000072 worker-1-13 67152 216975 30.949188
1534948572.688750 900.000028 worker-1-18 70383 235710 29.859997
1534948572.705484 900.000037 worker-1-24 57008 201189 28.335545
1534948572.682147 900.000099 worker-1-5 61878 194825 31.760811
1534948572.699536 900.000061 worker-1-16 76385 256671 29.759887
1534948572.682829 900.000080 worker-1-29 52464 188150 27.884135
1534948572.683536 900.000049 worker-1-4 110222 314119 35.08925
worker-1-1: 1534949053.166850 recvd=813997 dropped=0 link=813997
worker-1-2: 1534949053.366803 recvd=873351 dropped=0 link=873353
worker-1-3: 1534949053.567778 recvd=1770808 dropped=0 link=1770810
worker-1-4: 1534949053.767852 recvd=865443 dropped=0 link=865449
worker-1-5: 1534949053.968873 recvd=349355 dropped=0 link=349361
worker-1-6: 1534949054.168785 recvd=1152160 dropped=0 link=1152161
worker-1-7: 1534949054.368825 recvd=1358553 dropped=0 link=1358553
worker-1-8: 1534949054.569808 recvd=345267 dropped=0 link=345272
worker-1-9: 1534949054.769982 recvd=856725 dropped=0 link=856732
worker-1-10: 1534949054.969811 recvd=351148 dropped=0 link=351148
worker-1-11: 1534949055.170855 recvd=883897 dropped=0 link=883897
worker-1-12: 1534949055.370950 recvd=820117 dropped=0 link=820125
worker-1-13: 1534949055.571899 recvd=1132465 dropped=0 link=1132473
worker-1-14: 1534949055.771751 recvd=823249 dropped=0 link=823249
worker-1-15: 1534949055.972921 recvd=754342 dropped=0 link=754343
worker-1-16: 1534949056.173778 recvd=822102 dropped=0 link=822106
worker-1-17: 1534949056.373806 recvd=570905 dropped=0 link=570911
worker-1-18: 1534949056.573815 recvd=1033845 dropped=0 link=1033846
worker-1-19: 1534949056.774737 recvd=648977 dropped=0 link=649001
worker-1-20: 1534949056.974823 recvd=816836 dropped=0 link=816838
worker-1-21: 1534949057.175858 recvd=423896 dropped=0 link=423901
worker-1-22: 1534949057.375894 recvd=761794 dropped=0 link=761796
worker-1-23: 1534949057.576737 recvd=415151 dropped=0 link=415153
worker-1-24: 1534949057.776887 recvd=604342 dropped=0 link=604349
worker-1-25: 1534949057.978046 recvd=911772 dropped=0 link=911785
worker-1-26: 1534949058.177749 recvd=358386 dropped=0 link=358395
worker-1-27: 1534949058.379062 recvd=1283463 dropped=0 link=1283465
worker-1-28: 1534949058.578751 recvd=364801 dropped=0 link=364807
worker-1-29: 1534949058.778735 recvd=930041 dropped=0 link=930042
worker-1-30: 1534949058.979938 recvd=857963 dropped=0 link=857967
If you're seeing a high percentage of capture loss but netstats is showing 0 dropped packets that means one of two things:

* Something still isn't right with the load balancing. It could be that your NIC isn't doing symmetric hashing properly.
* There's an issue with the traffic upstream of bro.

A bunch of the checks that bro-doctor does can help diagnose this, but you'd need to re-enable the conn.log


Justin Azoff
Ron McClellan
2018-08-22 15:40:54 UTC
Permalink
Sorry, forgot to send that, I did re-enable the conn.log.

Ron


#################################################################
# Checking if many recent connections have a SAD or had history #
#################################################################
error: 24.30%, 52 out of 214 connections are half duplex


################################################
# Checking for recent capture_loss.log entries #
################################################
Capture loss stats:
worker=worker-1-1 loss_count=169 noloss_count=5 min_loss=0.0 max_loss=49.699158 overall_loss=0.711248084022
worker=worker-1-10 loss_count=165 noloss_count=9 min_loss=0.0 max_loss=31.734317 overall_loss=0.708971591962
worker=worker-1-11 loss_count=163 noloss_count=7 min_loss=0.0 max_loss=49.215598 overall_loss=0.733236081253
worker=worker-1-12 loss_count=172 noloss_count=7 min_loss=0.0 max_loss=19.54023 overall_loss=0.698003997145
worker=worker-1-13 loss_count=170 noloss_count=3 min_loss=0.0 max_loss=59.151943 overall_loss=0.708047294873
worker=worker-1-14 loss_count=166 noloss_count=4 min_loss=0.0 max_loss=31.102362 overall_loss=0.729493755422
worker=worker-1-15 loss_count=175 noloss_count=5 min_loss=0.0 max_loss=23.337269 overall_loss=0.696221193462
worker=worker-1-16 loss_count=171 noloss_count=3 min_loss=0.0 max_loss=21.665887 overall_loss=0.730258043624
worker=worker-1-17 loss_count=166 noloss_count=2 min_loss=0.0 max_loss=24.537037 overall_loss=0.730114941561
worker=worker-1-18 loss_count=173 noloss_count=5 min_loss=0.0 max_loss=35.149385 overall_loss=0.707049677317
worker=worker-1-19 loss_count=170 noloss_count=2 min_loss=0.0 max_loss=29.000411 overall_loss=0.715538683512
worker=worker-1-2 loss_count=167 noloss_count=3 min_loss=0.0 max_loss=25.104778 overall_loss=0.725878156063
worker=worker-1-20 loss_count=157 noloss_count=11 min_loss=0.0 max_loss=27.755102 overall_loss=0.732279462789
worker=worker-1-21 loss_count=175 noloss_count=3 min_loss=0.0 max_loss=21.428571 overall_loss=0.709650475913
worker=worker-1-22 loss_count=166 noloss_count=4 min_loss=0.0 max_loss=47.108131 overall_loss=0.723601858469
worker=worker-1-23 loss_count=163 noloss_count=5 min_loss=0.0 max_loss=26.769231 overall_loss=0.737703465832
worker=worker-1-24 loss_count=171 noloss_count=5 min_loss=0.0 max_loss=29.166667 overall_loss=0.70277340392
worker=worker-1-25 loss_count=167 noloss_count=5 min_loss=0.0 max_loss=43.062904 overall_loss=0.719155267119
worker=worker-1-26 loss_count=166 noloss_count=2 min_loss=0.0 max_loss=37.043478 overall_loss=0.733480109044
worker=worker-1-27 loss_count=172 noloss_count=7 min_loss=0.0 max_loss=21.570681 overall_loss=0.702686952111
worker=worker-1-28 loss_count=168 noloss_count=4 min_loss=0.0 max_loss=63.5 overall_loss=0.715802336358
worker=worker-1-29 loss_count=163 noloss_count=5 min_loss=0.0 max_loss=21.200387 overall_loss=0.726079337034
worker=worker-1-3 loss_count=178 noloss_count=3 min_loss=0.0 max_loss=21.5 overall_loss=0.695671670577
worker=worker-1-30 loss_count=170 noloss_count=7 min_loss=0.0 max_loss=52.182163 overall_loss=0.711452182534
worker=worker-1-31 loss_count=101 noloss_count=0 min_loss=0.249393 max_loss=2.802944 overall_loss=0.734132084543
worker=worker-1-32 loss_count=100 noloss_count=0 min_loss=0.232239 max_loss=2.823435 overall_loss=0.732152381242
worker=worker-1-33 loss_count=111 noloss_count=0 min_loss=0.245892 max_loss=2.823399 overall_loss=0.715639440645
worker=worker-1-34 loss_count=103 noloss_count=0 min_loss=0.222024 max_loss=2.813374 overall_loss=0.711321116497
worker=worker-1-35 loss_count=100 noloss_count=0 min_loss=0.219738 max_loss=2.830536 overall_loss=0.728313426089
worker=worker-1-36 loss_count=111 noloss_count=0 min_loss=0.242244 max_loss=2.810973 overall_loss=0.711242220111
worker=worker-1-37 loss_count=104 noloss_count=0 min_loss=0.219657 max_loss=2.804831 overall_loss=0.710176458823
worker=worker-1-38 loss_count=100 noloss_count=0 min_loss=0.219585 max_loss=2.808788 overall_loss=0.724778339046
worker=worker-1-39 loss_count=111 noloss_count=0 min_loss=0.219585 max_loss=2.803075 overall_loss=0.695610695301
worker=worker-1-4 loss_count=170 noloss_count=4 min_loss=0.0 max_loss=44.75899 overall_loss=0.710287169691
worker=worker-1-40 loss_count=104 noloss_count=0 min_loss=0.219585 max_loss=2.8021 overall_loss=0.70998112771
worker=worker-1-41 loss_count=100 noloss_count=0 min_loss=0.219585 max_loss=2.802619 overall_loss=0.72342640507
worker=worker-1-42 loss_count=110 noloss_count=0 min_loss=0.219442 max_loss=2.809179 overall_loss=0.694948046234
worker=worker-1-43 loss_count=104 noloss_count=0 min_loss=0.219442 max_loss=2.799245 overall_loss=0.709799229819
worker=worker-1-44 loss_count=100 noloss_count=0 min_loss=0.219513 max_loss=2.799373 overall_loss=0.721964444083
worker=worker-1-45 loss_count=111 noloss_count=0 min_loss=0.219441 max_loss=2.80574 overall_loss=0.695340745288
worker=worker-1-46 loss_count=104 noloss_count=0 min_loss=0.219585 max_loss=2.806646 overall_loss=0.710300132624
error: worker=worker-1-47 loss_count=42 noloss_count=0 min_loss=0.252768 max_loss=2.803858 overall_loss=1.00735434535
worker=worker-1-48 loss_count=55 noloss_count=0 min_loss=0.252767 max_loss=2.803207 overall_loss=0.874731142836
error: worker=worker-1-49 loss_count=20 noloss_count=0 min_loss=0.827054 max_loss=2.806194 overall_loss=1.3206719637
worker=worker-1-5 loss_count=164 noloss_count=6 min_loss=0.0 max_loss=28.184893 overall_loss=0.721071748929
error: worker=worker-1-50 loss_count=20 noloss_count=0 min_loss=0.828338 max_loss=2.807944 overall_loss=1.32179609476
worker=worker-1-6 loss_count=178 noloss_count=3 min_loss=0.0 max_loss=31.229236 overall_loss=0.69677119794
worker=worker-1-7 loss_count=170 noloss_count=4 min_loss=0.0 max_loss=23.232323 overall_loss=0.706048088686
worker=worker-1-8 loss_count=169 noloss_count=1 min_loss=0.0 max_loss=28.685259 overall_loss=0.720893457905
worker=worker-1-9 loss_count=174 noloss_count=7 min_loss=0.0 max_loss=21.649485 overall_loss=0.693994393064


################################################################
# Checking what percentage of recent tcp connections show loss #
################################################################
error: 13.08%, 28 out of 214 connections have capture loss


###################################################################
# Checking if connections are unevenly distributed across workers #
###################################################################
error: No node names in conn log. Install add-node-names package to add a corresponding field.


###############################################################################################################################
# Checking if anything is in the deprecated local-logger.bro, local-manager.bro, local-proxy.bro, or local-worker.bro scripts #
###############################################################################################################################
Nothing found


######################################################################
# Checking if any recent connections have been logged multiple times #
######################################################################
ok, only 0.00%, 0 out of 19 connections appear to be duplicate


############################################################################
# Checking what percentage of recent tcp connections are remote to remote. #
############################################################################
0.00%, 0 out of 100000 connections are remote to remote


###############################################################################
# Checking if bro is linked against a custom malloc like tcmalloc or jemalloc #
###############################################################################
error: configured to use a custom malloc=False


##################################
# Checking pf_ring configuration #
##################################
configured to use pf_ring=False pcap=False plugin=False


############################################
# Checking for recent reporter.log entries #
############################################
error: Found 66 reporter log files in the past 7 days
Recent reporter.log messages:
Reporter::INFO processing suspended
Reporter::INFO processing suspended
305 duplicate messages suppressed
Reporter::INFO processing continued
Reporter::INFO processing continued
344 duplicate messages suppressed
1534945569.655686 Reporter::INFO received termination signal
1534945569.657754 Reporter::INFO received termination signal
174 duplicate messages suppressed
1534945569.655686 Reporter::INFO 59878 packets received on interface ens1f0, 0 dropped
1534945569.657754 Reporter::INFO 56518 packets received on interface ens1f0, 0 dropped
1534945569.660757 Reporter::INFO 56724 packets received on interface ens1f0, 0 dropped
1534945569.658804 Reporter::INFO 54919 packets received on interface ens1f0, 0 dropped
1534945569.663107 Reporter::INFO 55802 packets received on interface ens1f0, 0 dropped
1534945569.665709 Reporter::INFO 55495 packets received on interface ens1f0, 0 dropped
1534945569.664417 Reporter::INFO 56303 packets received on interface ens1f0, 0 dropped
1534945569.661807 Reporter::INFO 54850 packets received on interface ens1f0, 0 dropped
1534945569.652757 Reporter::INFO 56683 packets received on interface ens1f0, 0 dropped
1534945569.666680 Reporter::INFO 728568 packets received on interface ens1f0, 0 dropped
1534945569.668403 Reporter::INFO 54410 packets received on interface ens1f0, 0 dropped
1534945569.659922 Reporter::INFO 66057 packets received on interface ens1f0, 0 dropped
1534945569.670068 Reporter::INFO 96452 packets received on interface ens1f0, 0 dropped
1534945569.654122 Reporter::INFO 60502 packets received on interface ens1f0, 0 dropped
1534945569.674937 Reporter::INFO 534148 packets received on interface ens1f0, 0 dropped
1534945569.676096 Reporter::INFO 59423 packets received on interface ens1f0, 0 dropped
1534945569.671680 Reporter::INFO 54199 packets received on interface ens1f0, 0 dropped
1534945569.675688 Reporter::INFO 55103 packets received on interface ens1f0, 0 dropped
1534945569.672645 Reporter::INFO 55276 packets received on interface ens1f0, 0 dropped
1534945569.681012 Reporter::INFO 59191 packets received on interface ens1f0, 0 dropped
1534945569.685158 Reporter::INFO 526830 packets received on interface ens1f0, 0 dropped
1534945569.682198 Reporter::INFO 56117 packets received on interface ens1f0, 0 dropped
1534945569.686343 Reporter::INFO 280179 packets received on interface ens1f0, 0 dropped
1534945569.685319 Reporter::INFO 53622 packets received on interface ens1f0, 0 dropped
1534945569.690807 Reporter::INFO 55072 packets received on interface ens1f0, 0 dropped
1534945569.679809 Reporter::INFO 425150 packets received on interface ens1f0, 0 dropped
1534945569.680279 Reporter::INFO 57843 packets received on interface ens1f0, 0 dropped
1534945569.692148 Reporter::INFO 54710 packets received on interface ens1f0, 0 dropped
1534945569.677281 Reporter::INFO 55582 packets received on interface ens1f0, 0 dropped
1534945569.689345 Reporter::INFO 67500 packets received on interface ens1f0, 0 dropped
1534944921.064766 Reporter::INFO 7476939 packets received on interface ens1f0, 0 dropped
1534944921.066590 Reporter::INFO 5979272 packets received on interface ens1f0, 0 dropped
1534944921.069660 Reporter::INFO 11350634 packets received on interface ens1f0, 0 dropped
1534944921.068621 Reporter::INFO 7846634 packets received on interface ens1f0, 0 dropped
1534944921.070665 Reporter::INFO 7716706 packets received on interface ens1f0, 0 dropped
1534944921.072682 Reporter::INFO 8075699 packets received on interface ens1f0, 0 dropped
1534944921.073648 Reporter::INFO 7254555 packets received on interface ens1f0, 0 dropped
1534944921.071549 Reporter::INFO 8783077 packets received on interface ens1f0, 0 dropped
1534944921.076533 Reporter::INFO 7179347 packets received on interface ens1f0, 0 dropped
1534944921.075584 Reporter::INFO 7727934 packets received on interface ens1f0, 0 dropped
1534944921.074546 Reporter::INFO 6784747 packets received on interface ens1f0, 0 dropped
1534944921.078555 Reporter::INFO 6505658 packets received on interface ens1f0, 0 dropped
1534944921.079501 Reporter::INFO 11272110 packets received on interface ens1f0, 0 dropped
1534944921.077461 Reporter::INFO 11702900 packets received on interface ens1f0, 0 dropped
1534944921.083506 Reporter::INFO 9770050 packets received on interface ens1f0, 0 dropped
1534944921.084694 Reporter::INFO 5377432 packets received on interface ens1f0, 0 dropped
1534944921.080490 Reporter::INFO 7853539 packets received on interface ens1f0, 0 dropped
1534944921.089195 Reporter::INFO 7411832 packets received on interface ens1f0, 0 dropped
1534944921.088097 Reporter::INFO 20093065 packets received on interface ens1f0, 0 dropped
1534944921.081517 Reporter::INFO 10139208 packets received on interface ens1f0, 0 dropped
1534944921.082494 Reporter::INFO 9928228 packets received on interface ens1f0, 0 dropped
1534944921.085820 Reporter::INFO 9309141 packets received on interface ens1f0, 0 dropped
1534944921.092678 Reporter::INFO 7987870 packets received on interface ens1f0, 0 dropped
1534944921.093619 Reporter::INFO 9510548 packets received on interface ens1f0, 0 dropped
1534944921.095941 Reporter::INFO 7619163 packets received on interface ens1f0, 0 dropped
1534944921.071034 Reporter::INFO 7962640 packets received on interface ens1f0, 2888 dropped
1534944921.099008 Reporter::INFO 7523835 packets received on interface ens1f0, 0 dropped
1534944921.101288 Reporter::INFO 8860676 packets received on interface ens1f0, 0 dropped
1534944921.102270 Reporter::INFO 7432686 packets received on interface ens1f0, 0 dropped
1534944921.099397 Reporter::INFO 8637163 packets received on interface ens1f0, 0 dropped
/usr/local/bro/share/bro/base/misc/find-checksum-offloading.bro, line 54 1534945591.937813 Reporter::WARNING Your interface is likely receiving invalid TCP checksums, most likely from NIC checksum offloading. By default, packets with invalid checksums are discarded by Bro unless using the -C command-line option or toggling the 'ignore_checksums' variable. Alternatively, disable checksum offloading by the network adapter to ensure Bro analyzes the actual checksums that are transmitted.
/usr/local/bro/share/bro/base/misc/find-checksum-offloading.bro, line 54 1534891136.995923 Reporter::WARNING Your interface is likely receiving invalid TCP checksums, most likely from NIC checksum offloading. By default, packets with invalid checksums are discarded by Bro unless using the -C command-line option or toggling the 'ignore_checksums' variable. Alternatively, disable checksum offloading by the network adapter to ensure Bro analyzes the actual checksums that are transmitted.
5 duplicate messages suppressed
1534891114.741981 Reporter::INFO 372014 packets received on interface ens1f0, 0 dropped
1534891114.743404 Reporter::INFO 1220128 packets received on interface ens1f0, 0 dropped
1534891114.744481 Reporter::INFO 661511 packets received on interface ens1f0, 0 dropped
1534891114.745450 Reporter::INFO 576353 packets received on interface ens1f0, 0 dropped
1534891114.746610 Reporter::INFO 186013 packets received on interface ens1f0, 0 dropped
1534891114.747555 Reporter::INFO 1719908 packets received on interface ens1f0, 0 dropped
1534891114.749554 Reporter::INFO 182315 packets received on interface ens1f0, 0 dropped
1534891114.750730 Reporter::INFO 9263131 packets received on interface ens1f0, 0 dropped
1534891114.754252 Reporter::INFO 920965 packets received on interface ens1f0, 0 dropped
1534891114.740742 Reporter::INFO 184512 packets received on interface ens1f0, 0 dropped
1534891114.755328 Reporter::INFO 191456 packets received on interface ens1f0, 0 dropped
1534891114.752679 Reporter::INFO 556688 packets received on interface ens1f0, 0 dropped
1534891114.756511 Reporter::INFO 775373 packets received on interface ens1f0, 0 dropped
1534891114.739297 Reporter::INFO 898836 packets received on interface ens1f0, 0 dropped
1534891114.757522 Reporter::INFO 185947 packets received on interface ens1f0, 0 dropped
1534891114.759550 Reporter::INFO 196218 packets received on interface ens1f0, 0 dropped
1534891114.759936 Reporter::INFO 194993 packets received on interface ens1f0, 0 dropped
1534891114.762748 Reporter::INFO 834002 packets received on interface ens1f0, 0 dropped
1534891114.764149 Reporter::INFO 193840 packets received on interface ens1f0, 0 dropped
1534891114.762527 Reporter::INFO 186095 packets received on interface ens1f0, 0 dropped
1534891114.765131 Reporter::INFO 259919 packets received on interface ens1f0, 0 dropped
1534891114.748692 Reporter::INFO 1396336 packets received on interface ens1f0, 0 dropped
1534891114.751699 Reporter::INFO 179743 packets received on interface ens1f0, 0 dropped
1534891114.767549 Reporter::INFO 187725 packets received on interface ens1f0, 0 dropped
1534891114.767200 Reporter::INFO 811731 packets received on interface ens1f0, 0 dropped
1534891114.771534 Reporter::INFO 187778 packets received on interface ens1f0, 0 dropped
1534891114.769285 Reporter::INFO 207081 packets received on interface ens1f0, 0 dropped
1534891114.768756 Reporter::INFO 225592 packets received on interface ens1f0, 0 dropped
1534891114.770325 Reporter::INFO 190033 packets received on interface ens1f0, 0 dropped
1534891114.772342 Reporter::INFO 182314 packets received on interface ens1f0, 0 dropped
1534883904.208832 Reporter::INFO 80395 packets received on interface ens1f0, 213 dropped
1534883904.212230 Reporter::INFO 80261 packets received on interface ens1f0, 248 dropped
1534883904.210355 Reporter::INFO 80721 packets received on interface ens1f0, 205 dropped
1534883904.205811 Reporter::INFO 80689 packets received on interface ens1f0, 211 dropped
1534883904.202076 Reporter::INFO 79672 packets received on interface ens1f0, 317 dropped
1534883904.204229 Reporter::INFO 79679 packets received on interface ens1f0, 317 dropped
1534883904.187896 Reporter::INFO 79594 packets received on interface ens1f0, 316 dropped
1534883904.197151 Reporter::INFO 79645 packets received on interface ens1f0, 315 dropped
1534883904.193226 Reporter::INFO 79615 packets received on interface ens1f0, 315 dropped
1534883904.195856 Reporter::INFO 79643 packets received on interface ens1f0, 317 dropped
1534883904.186647 Reporter::INFO 79592 packets received on interface ens1f0, 314 dropped
1534883904.200680 Reporter::INFO 79667 packets received on interface ens1f0, 316 dropped
1534883904.192134 Reporter::INFO 79550 packets received on interface ens1f0, 311 dropped
1534883904.189401 Reporter::INFO 79534 packets received on interface ens1f0, 315 dropped
1534883904.194799 Reporter::INFO 79564 packets received on interface ens1f0, 304 dropped
1534883904.192234 Reporter::INFO 79542 packets received on interface ens1f0, 298 dropped
1534886979.721378 Reporter::INFO 1072583 packets received on interface ens1f0, 141 dropped
1534886979.732950 Reporter::INFO 1070920 packets received on interface ens1f0, 275 dropped
1534886979.724258 Reporter::INFO 1072912 packets received on interface ens1f0, 0 dropped
1534886979.729367 Reporter::INFO 1070892 packets received on interface ens1f0, 274 dropped
1534886979.734314 Reporter::INFO 1072735 packets received on interface ens1f0, 0 dropped
1534886979.725162 Reporter::INFO 1072659 packets received on interface ens1f0, 0 dropped
1534886979.727819 Reporter::INFO 1072688 packets received on interface ens1f0, 0 dropped
1534886979.730567 Reporter::INFO 1072646 packets received on interface ens1f0, 127 dropped
1534886979.722836 Reporter::INFO 1072663 packets received on interface ens1f0, 0 dropped
1534886979.731947 Reporter::INFO 1072706 packets received on interface ens1f0, 0 dropped
1534886979.736491 Reporter::INFO 1072677 packets received on interface ens1f0, 141 dropped
1534886979.735473 Reporter::INFO 1070923 packets received on interface ens1f0, 283 dropped
1534886979.740571 Reporter::INFO 1070961 packets received on interface ens1f0, 1645 dropped
1534886979.738937 Reporter::INFO 1070963 packets received on interface ens1f0, 1634 dropped
1534886979.745314 Reporter::INFO 1070974 packets received on interface ens1f0, 1658 dropped
1534886979.746646 Reporter::INFO 1070971 packets received on interface ens1f0, 1645 dropped
1534886979.741628 Reporter::INFO 1070914 packets received on interface ens1f0, 1631 dropped
1534886979.737836 Reporter::INFO 1070932 packets received on interface ens1f0, 1644 dropped
1534886979.747760 Reporter::INFO 1070998 packets received on interface ens1f0, 1645 dropped
1534886979.743048 Reporter::INFO 1070961 packets received on interface ens1f0, 1646 dropped
1534886979.752948 Reporter::INFO 1071049 packets received on interface ens1f0, 1645 dropped
1534886979.756786 Reporter::INFO 1071055 packets received on interface ens1f0, 1645 dropped
1534886979.755328 Reporter::INFO 1071055 packets received on interface ens1f0, 1645 dropped
1534886979.748970 Reporter::INFO 1071019 packets received on interface ens1f0, 1645 dropped
1534886979.751311 Reporter::INFO 1071005 packets received on interface ens1f0, 1645 dropped
1534886979.744325 Reporter::INFO 1070921 packets received on interface ens1f0, 1636 dropped
1534886979.758477 Reporter::INFO 1071064 packets received on interface ens1f0, 1656 dropped
1534886979.750125 Reporter::INFO 1071021 packets received on interface ens1f0, 1645 dropped
1534886979.761657 Reporter::INFO 1071083 packets received on interface ens1f0, 1645 dropped
1534886979.759686 Reporter::INFO 1071089 packets received on interface ens1f0, 1645 dropped
1534885764.945067 Reporter::INFO 19126359 packets received on interface ens1f0, 0 dropped
1534885764.945531 Reporter::INFO 19126359 packets received on interface ens1f0, 0 dropped
1534885764.946470 Reporter::INFO 19126368 packets received on interface ens1f0, 0 dropped
1534885764.951497 Reporter::INFO 19126406 packets received on interface ens1f0, 0 dropped
1534885764.957197 Reporter::INFO 19126436 packets received on interface ens1f0, 0 dropped
1534885764.940357 Reporter::INFO 19126342 packets received on interface ens1f0, 0 dropped
1534885764.949431 Reporter::INFO 19126391 packets received on interface ens1f0, 0 dropped
1534885764.942220 Reporter::INFO 19126355 packets received on interface ens1f0, 0 dropped
1534885764.952726 Reporter::INFO 19126408 packets received on interface ens1f0, 0 dropped
1534885764.959421 Reporter::INFO 19126450 packets received on interface ens1f0, 0 dropped
1534885764.954249 Reporter::INFO 19126417 packets received on interface ens1f0, 0 dropped
1534885764.950288 Reporter::INFO 19126397 packets received on interface ens1f0, 0 dropped
1534885764.955658 Reporter::INFO 19126430 packets received on interface ens1f0, 0 dropped
1534885764.957783 Reporter::INFO 19126441 packets received on interface ens1f0, 0 dropped
1534885764.963722 Reporter::INFO 19126483 packets received on interface ens1f0, 0 dropped
1534885764.962441 Reporter::INFO 19126469 packets received on interface ens1f0, 0 dropped
1534896852.839809 Reporter::INFO 1030981 packets received on interface ens1f0, 0 dropped
1534896852.838278 Reporter::INFO 392308 packets received on interface ens1f0, 0 dropped
1534896852.841137 Reporter::INFO 818619 packets received on interface ens1f0, 0 dropped
1534896852.844479 Reporter::INFO 1067255 packets received on interface ens1f0, 0 dropped
1534896852.842345 Reporter::INFO 1029386 packets received on interface ens1f0, 0 dropped
1534896852.843326 Reporter::INFO 841578 packets received on interface ens1f0, 0 dropped
1534896852.848442 Reporter::INFO 1045593 packets received on interface ens1f0, 0 dropped
1534896852.845493 Reporter::INFO 551282 packets received on interface ens1f0, 0 dropped
1534896852.846489 Reporter::INFO 2077035 packets received on interface ens1f0, 0 dropped
1534896852.847490 Reporter::INFO 1160096 packets received on interface ens1f0, 0 dropped
1534896852.849339 Reporter::INFO 995455 packets received on interface ens1f0, 0 dropped
1534896852.853370 Reporter::INFO 1027908 packets received on interface ens1f0, 0 dropped

====================================================================

#################################################################
# Checking if many recent connections have a SAD or had history #
#################################################################
error: 75.53%, 14289 out of 18918 connections are half duplex


################################################
# Checking for recent capture_loss.log entries #
################################################
Capture loss stats:
worker=bro loss_count=0 noloss_count=168 min_loss=0.0 max_loss=0.0 overall_loss=0.0
error: worker=worker-1-1 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=90.58044 overall_loss=30.0099164747
error: worker=worker-1-10 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=66.577529 overall_loss=26.2442908547
error: worker=worker-1-11 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=67.87389 overall_loss=26.7675967606
error: worker=worker-1-12 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=77.820137 overall_loss=27.4078064757
error: worker=worker-1-13 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=67.574163 overall_loss=27.6988371111
error: worker=worker-1-14 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=63.196486 overall_loss=27.1931529853
error: worker=worker-1-15 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=69.743213 overall_loss=32.3055577642
error: worker=worker-1-16 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=65.566384 overall_loss=26.8274826352
error: worker=worker-1-17 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=66.476362 overall_loss=26.5423912274
error: worker=worker-1-18 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=64.410006 overall_loss=24.3477958539
error: worker=worker-1-19 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=63.185604 overall_loss=27.4475582221
error: worker=worker-1-2 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=69.678688 overall_loss=28.7640239525
error: worker=worker-1-20 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=63.994922 overall_loss=23.2474631419
error: worker=worker-1-21 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=65.539711 overall_loss=27.260502257
error: worker=worker-1-22 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=69.499069 overall_loss=30.2300032238
error: worker=worker-1-23 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=64.149083 overall_loss=31.1798858888
error: worker=worker-1-24 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=66.324143 overall_loss=22.5580314778
error: worker=worker-1-25 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=69.687833 overall_loss=29.667065343
error: worker=worker-1-26 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=65.863473 overall_loss=31.1213730204
error: worker=worker-1-27 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=65.497259 overall_loss=26.6064033512
error: worker=worker-1-28 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=64.144564 overall_loss=30.1043332527
error: worker=worker-1-29 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=64.903986 overall_loss=29.6894566048
error: worker=worker-1-3 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=68.130508 overall_loss=25.3573669197
error: worker=worker-1-30 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=72.621044 overall_loss=24.3457544317
worker=worker-1-31 loss_count=0 noloss_count=1 min_loss=0.0 max_loss=0.0 overall_loss=0.0
worker=worker-1-32 loss_count=0 noloss_count=1 min_loss=0.0 max_loss=0.0 overall_loss=0.0
worker=worker-1-33 loss_count=0 noloss_count=1 min_loss=0.0 max_loss=0.0 overall_loss=0.0
worker=worker-1-34 loss_count=0 noloss_count=1 min_loss=0.0 max_loss=0.0 overall_loss=0.0
worker=worker-1-35 loss_count=0 noloss_count=1 min_loss=0.0 max_loss=0.0 overall_loss=0.0
worker=worker-1-36 loss_count=0 noloss_count=1 min_loss=0.0 max_loss=0.0 overall_loss=0.0
error: worker=worker-1-4 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=67.538442 overall_loss=29.82586108
error: worker=worker-1-5 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=71.771296 overall_loss=31.4585270245
error: worker=worker-1-6 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=71.430447 overall_loss=27.4405648085
error: worker=worker-1-7 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=66.640978 overall_loss=25.8919190418
error: worker=worker-1-8 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=81.943911 overall_loss=34.0497839132
error: worker=worker-1-9 loss_count=69 noloss_count=1 min_loss=0.0 max_loss=72.243329 overall_loss=31.5214084629


################################################################
# Checking what percentage of recent tcp connections show loss #
################################################################
error: 13.21%, 2499 out of 18918 connections have capture loss


###################################################################
# Checking if connections are unevenly distributed across workers #
###################################################################
error: No node names in conn log. Install add-node-names package to add a corresponding field.


###############################################################################################################################
# Checking if anything is in the deprecated local-logger.bro, local-manager.bro, local-proxy.bro, or local-worker.bro scripts #
###############################################################################################################################
Nothing found


######################################################################
# Checking if any recent connections have been logged multiple times #
######################################################################
ok, only 0.00%, 0 out of 694 connections appear to be duplicate


############################################################################
# Checking what percentage of recent tcp connections are remote to remote. #
############################################################################
error: 52.85%, 52853 out of 100000 connections are remote to remote


###############################################################################
# Checking if bro is linked against a custom malloc like tcmalloc or jemalloc #
###############################################################################
error: configured to use a custom malloc=False


##################################
# Checking pf_ring configuration #
##################################
configured to use pf_ring=False pcap=False plugin=False


############################################
# Checking for recent reporter.log entries #
############################################
error: Found 14 reporter log files in the past 7 days
Recent reporter.log messages:
/usr/local/bro/share/bro/policy/protocols/ssh/geo-data.bro, line 30 1534951863.435398 Reporter::ERROR Bro was not configured for GeoIP support (lookup_location(SSH::lookup_ip))
/usr/local/bro/share/bro/policy/protocols/ssh/geo-data.bro, line 30 1534945001.027149 Reporter::ERROR Bro was not configured for GeoIP support (lookup_location(SSH::lookup_ip))
84 duplicate messages suppressed
1534944972.753733 Reporter::INFO processing suspended
1534944972.748734 Reporter::INFO processing suspended
133 duplicate messages suppressed
1534944972.753733 Reporter::INFO processing continued
1534944972.748734 Reporter::INFO processing continued
154 duplicate messages suppressed
/usr/local/bro/share/bro/base/misc/find-checksum-offloading.bro, line 54 1534944982.705403 Reporter::WARNING Your interface is likely receiving invalid TCP checksums, most likely from NIC checksum offloading. By default, packets with invalid checksums are discarded by Bro unless using the -C command-line option or toggling the 'ignore_checksums' variable. Alternatively, disable checksum offloading by the network adapter to ensure Bro analyzes the actual checksums that are transmitted.
/usr/local/bro/share/bro/base/misc/find-checksum-offloading.bro, line 54 1534891137.279399 Reporter::WARNING Your interface is likely receiving invalid TCP checksums, most likely from NIC checksum offloading. By default, packets with invalid checksums are discarded by Bro unless using the -C command-line option or toggling the 'ignore_checksums' variable. Alternatively, disable checksum offloading by the network adapter to ensure Bro analyzes the actual checksums that are transmitted.
49 duplicate messages suppressed
1534944917.319061 Reporter::INFO received termination signal
1534944917.325143 Reporter::INFO received termination signal
62 duplicate messages suppressed
1534944917.319061 Reporter::INFO 59439914 packets received on interface ens1f0, 0 dropped
1534944917.325143 Reporter::INFO 45022136 packets received on interface ens1f0, 0 dropped
1534944917.321217 Reporter::INFO 51627314 packets received on interface ens1f0, 0 dropped
1534944917.322883 Reporter::INFO 53891270 packets received on interface ens1f0, 0 dropped
1534944917.324085 Reporter::INFO 88417370 packets received on interface ens1f0, 0 dropped
1534944917.319957 Reporter::INFO 54791897 packets received on interface ens1f0, 0 dropped
1534944917.327305 Reporter::INFO 43678446 packets received on interface ens1f0, 0 dropped
1534944917.330087 Reporter::INFO 59703116 packets received on interface ens1f0, 0 dropped
1534944917.328208 Reporter::INFO 45211345 packets received on interface ens1f0, 0 dropped
1534944917.331134 Reporter::INFO 65262767 packets received on interface ens1f0, 0 dropped
1534944917.334825 Reporter::INFO 55188270 packets received on interface ens1f0, 0 dropped
1534944917.329257 Reporter::INFO 52034434 packets received on interface ens1f0, 0 dropped
1534944917.332107 Reporter::INFO 49495024 packets received on interface ens1f0, 0 dropped
1534944917.333015 Reporter::INFO 49204592 packets received on interface ens1f0, 0 dropped
1534944917.336672 Reporter::INFO 50396962 packets received on interface ens1f0, 0 dropped
1534944917.333880 Reporter::INFO 54559564 packets received on interface ens1f0, 0 dropped
1534944917.338831 Reporter::INFO 55635174 packets received on interface ens1f0, 0 dropped
1534944917.341397 Reporter::INFO 58206831 packets received on interface ens1f0, 0 dropped
1534944917.344880 Reporter::INFO 56679728 packets received on interface ens1f0, 0 dropped
1534944917.343923 Reporter::INFO 50649583 packets received on interface ens1f0, 0 dropped
1534944917.345953 Reporter::INFO 69599231 packets received on interface ens1f0, 0 dropped
1534944917.340038 Reporter::INFO 52187914 packets received on interface ens1f0, 0 dropped
1534944917.342584 Reporter::INFO 54560034 packets received on interface ens1f0, 0 dropped
1534944917.346950 Reporter::INFO 43540603 packets received on interface ens1f0, 0 dropped
1534944917.351083 Reporter::INFO 94170430 packets received on interface ens1f0, 0 dropped
1534944917.326276 Reporter::INFO 61679606 packets received on interface ens1f0, 0 dropped
1534944917.348123 Reporter::INFO 52021287 packets received on interface ens1f0, 0 dropped
1534944917.350083 Reporter::INFO 59524072 packets received on interface ens1f0, 0 dropped
1534944917.349146 Reporter::INFO 53404009 packets received on interface ens1f0, 0 dropped
1534944917.337775 Reporter::INFO 53757378 packets received on interface ens1f0, 0 dropped
1534891115.907765 Reporter::INFO 5219842 packets received on interface ens1f0, 0 dropped
1534891115.910010 Reporter::INFO 8455238 packets received on interface ens1f0, 0 dropped
1534891115.912626 Reporter::INFO 5228397 packets received on interface ens1f0, 0 dropped
1534891115.908858 Reporter::INFO 5747798 packets received on interface ens1f0, 0 dropped
1534891115.911476 Reporter::INFO 5697178 packets received on interface ens1f0, 0 dropped
1534891115.915011 Reporter::INFO 4743597 packets received on interface ens1f0, 0 dropped
1534891115.918402 Reporter::INFO 6230492 packets received on interface ens1f0, 0 dropped
1534891115.913823 Reporter::INFO 7316772 packets received on interface ens1f0, 0 dropped
1534891115.920544 Reporter::INFO 8210898 packets received on interface ens1f0, 0 dropped
1534891115.919012 Reporter::INFO 5552873 packets received on interface ens1f0, 0 dropped
1534891115.922581 Reporter::INFO 4081197 packets received on interface ens1f0, 0 dropped
1534891115.925757 Reporter::INFO 5280114 packets received on interface ens1f0, 0 dropped
1534891115.923134 Reporter::INFO 5538765 packets received on interface ens1f0, 0 dropped
1534891115.927881 Reporter::INFO 5320840 packets received on interface ens1f0, 0 dropped
1534891115.934273 Reporter::INFO 5085653 packets received on interface ens1f0, 0 dropped
1534891115.928977 Reporter::INFO 9046510 packets received on interface ens1f0, 0 dropped
1534891115.921576 Reporter::INFO 5739363 packets received on interface ens1f0, 0 dropped
1534891115.924612 Reporter::INFO 6341512 packets received on interface ens1f0, 0 dropped
1534891115.930012 Reporter::INFO 7133185 packets received on interface ens1f0, 0 dropped
1534891115.926888 Reporter::INFO 7315668 packets received on interface ens1f0, 0 dropped
1534891115.941661 Reporter::INFO 5996086 packets received on interface ens1f0, 0 dropped
1534891115.935347 Reporter::INFO 7124354 packets received on interface ens1f0, 0 dropped
1534891115.932771 Reporter::INFO 4812550 packets received on interface ens1f0, 0 dropped
1534891115.938545 Reporter::INFO 6444252 packets received on interface ens1f0, 0 dropped
1534891115.936348 Reporter::INFO 5005644 packets received on interface ens1f0, 0 dropped
1534891115.916074 Reporter::INFO 5041262 packets received on interface ens1f0, 0 dropped
1534891115.939653 Reporter::INFO 5810241 packets received on interface ens1f0, 0 dropped
1534891115.940698 Reporter::INFO 5284307 packets received on interface ens1f0, 0 dropped
1534891115.917379 Reporter::INFO 5701186 packets received on interface ens1f0, 0 dropped
1534891115.937405 Reporter::INFO 4785317 packets received on interface ens1f0, 0 dropped
1531828134.620284 Reporter::INFO 424506 packets received on interface eno1, 0 dropped





-----Original Message-----
From: Azoff, Justin S <***@illinois.edu>
Sent: Wednesday, August 22, 2018 10:59 AM
To: Ron McClellan <***@ao.uscourts.gov>
Cc: ***@bro.org
Subject: Re: [Bro] BRO Logger crashing due to large DNS log files
Post by Ron McClellan
Justin,
Got good news and solid progress with your help. BRO is running on both boxes and hasn't crashed since 10pm last night. If I read the data about NUMA from my systems, I don't really need to split the load between 2 workers as you did, right?
If you can get another NIC so each box has 2, then you could divide the workers between each NIC and NUMA node. Otherwise it doesn't matter so much.
Post by Ron McClellan
I'm working on tuning some now and also trying to address the really high lag (500) that I'm still seeing. Currently seeing some loss on it, but will continue to tune and see what if I can get that under control. Let me know if you need help testing the doctor script.
Ron
1534948572.682908 900.000005 worker-1-8 60904 532647 11.434214
1534948572.692674 900.000072 worker-1-13 67152 216975 30.949188
1534948572.688750 900.000028 worker-1-18 70383 235710 29.859997
1534948572.705484 900.000037 worker-1-24 57008 201189 28.335545
1534948572.682147 900.000099 worker-1-5 61878 194825 31.760811
1534948572.699536 900.000061 worker-1-16 76385 256671 29.759887
1534948572.682829 900.000080 worker-1-29 52464 188150 27.884135
1534948572.683536 900.000049 worker-1-4 110222 314119 35.08925
worker-1-1: 1534949053.166850 recvd=813997 dropped=0 link=813997
worker-1-2: 1534949053.366803 recvd=873351 dropped=0 link=873353
worker-1-3: 1534949053.567778 recvd=1770808 dropped=0 link=1770810
worker-1-4: 1534949053.767852 recvd=865443 dropped=0 link=865449
worker-1-5: 1534949053.968873 recvd=349355 dropped=0 link=349361
worker-1-6: 1534949054.168785 recvd=1152160 dropped=0 link=1152161
worker-1-7: 1534949054.368825 recvd=1358553 dropped=0 link=1358553
worker-1-8: 1534949054.569808 recvd=345267 dropped=0 link=345272
worker-1-9: 1534949054.769982 recvd=856725 dropped=0 link=856732
worker-1-10: 1534949054.969811 recvd=351148 dropped=0 link=351148
worker-1-11: 1534949055.170855 recvd=883897 dropped=0 link=883897
worker-1-12: 1534949055.370950 recvd=820117 dropped=0 link=820125
worker-1-13: 1534949055.571899 recvd=1132465 dropped=0 link=1132473
worker-1-14: 1534949055.771751 recvd=823249 dropped=0 link=823249
worker-1-15: 1534949055.972921 recvd=754342 dropped=0 link=754343
worker-1-16: 1534949056.173778 recvd=822102 dropped=0 link=822106
worker-1-17: 1534949056.373806 recvd=570905 dropped=0 link=570911
worker-1-18: 1534949056.573815 recvd=1033845 dropped=0 link=1033846
worker-1-19: 1534949056.774737 recvd=648977 dropped=0 link=649001
worker-1-20: 1534949056.974823 recvd=816836 dropped=0 link=816838
worker-1-21: 1534949057.175858 recvd=423896 dropped=0 link=423901
worker-1-22: 1534949057.375894 recvd=761794 dropped=0 link=761796
worker-1-23: 1534949057.576737 recvd=415151 dropped=0 link=415153
worker-1-24: 1534949057.776887 recvd=604342 dropped=0 link=604349
worker-1-25: 1534949057.978046 recvd=911772 dropped=0 link=911785
worker-1-26: 1534949058.177749 recvd=358386 dropped=0 link=358395
worker-1-27: 1534949058.379062 recvd=1283463 dropped=0 link=1283465
worker-1-28: 1534949058.578751 recvd=364801 dropped=0 link=364807
worker-1-29: 1534949058.778735 recvd=930041 dropped=0 link=930042
worker-1-30: 1534949058.979938 recvd=857963 dropped=0 link=857967
If you're seeing a high percentage of capture loss but netstats is showing 0 dropped packets that means one of two things:

* Something still isn't right with the load balancing. It could be that your NIC isn't doing symmetric hashing properly.
* There's an issue with the traffic upstream of bro.

A bunch of the checks that bro-doctor does can help diagnose this, but you'd need to re-enable the conn.log


Justin Azoff
Azoff, Justin S
2018-08-22 17:00:36 UTC
Permalink
Post by Ron McClellan
Sorry, forgot to send that, I did re-enable the conn.log.
Ron
A few things look like they are still not right.
Post by Ron McClellan
/usr/local/bro/share/bro/base/misc/find-checksum-offloading.bro, line 54 1534945591.937813 Reporter::WARNING Your interface is likely receiving invalid TCP checksums, most likely from NIC checksum offloading. By default, packets with invalid checksums are discarded by Bro unless using the -C command-line option or toggling the 'ignore_checksums' variable. Alternatively, disable checksum offloading by the network adapter to ensure Bro analyzes the actual checksums that are transmitted.
/usr/local/bro/share/bro/base/misc/find-checksum-offloading.bro, line 54 1534891136.995923 Reporter::WARNING Your interface is likely receiving invalid TCP checksums, most likely from NIC checksum offloading. By default, packets with invalid checksums are discarded by Bro unless using the -C command-line option or toggling the 'ignore_checksums' variable. Alternatively, disable checksum offloading by the network adapter to ensure Bro analyzes the actual checksums that are transmitted.
5 duplicate messages suppressed
Ensuring all the offloading is setup correctly on your interfaces may fix a few of these. I have another plugin for broctl that can do this
automatically for you:

https://packages.bro.org/packages/view/7520ca9d-4fb7-11e8-88be-0a645a3f3086

you just need to install it and then add

interfacesetup.enabled=1

to your broctl.cfg

The next time bro restarts it will ensure a bunch of settings are set properly using ethtool.
Post by Ron McClellan
#################################################################
# Checking if many recent connections have a SAD or had history #
#################################################################
error: 24.30%, 52 out of 214 connections are half duplex
#################################################################
# Checking if many recent connections have a SAD or had history #
#################################################################
error: 75.53%, 14289 out of 18918 connections are half duplex
These 2 show that things are not working well at all. The rest of the checks don't mean much until this is fixed, so this is the thing to focus on.

This could be caused by an internal load balancing problem, or by an upstream issue. It's pretty easy to figure out which, I just haven't worked out the
best way to have bro-doctor automate it.

What you want to do is run this script from a host that bro will see the request and response:

for x in $(seq -w 1 19); do
echo -e 'GET / HTTP/1.1\r\nHost: www.bro.org\r\n\r\n' |
socat - tcp-connect:www.bro.org:80,sp=300$x,reuseaddr;
sleep 1;
done

Wait a few minutes (if packets are being dropped, it may take 5 minutes for the log entries to show up only after the connections timeout) and then run

cat conn.log |bro-cut -d ts id.orig_p id.resp_h id.resp_p history orig_pkts resp_pkts|fgrep 192.150.187.43|fgrep 300

You should see something like this:

2018-08-22T11:53:46-0500 30001 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:53:47-0500 30002 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:53:48-0500 30003 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:53:49-0500 30004 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:53:50-0500 30005 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:53:52-0500 30006 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:53:53-0500 30007 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:53:54-0500 30008 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:53:55-0500 30009 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:53:56-0500 30010 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:53:57-0500 30011 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:53:59-0500 30012 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:54:00-0500 30013 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:54:01-0500 30014 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:54:02-0500 30015 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:54:03-0500 30016 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:54:04-0500 30017 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:54:06-0500 30018 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:54:07-0500 30019 192.150.187.43 80 ShADFadf 6 4

duplicate, missing, or split entries indicate different problems.
Post by Ron McClellan
############################################################################
# Checking what percentage of recent tcp connections are remote to remote. #
############################################################################
error: 52.85%, 52853 out of 100000 connections are remote to remote
In your networks.cfg make sure you have listed all of your 'local' address space.


Justin Azoff
Ron McClellan
2018-08-23 01:00:36 UTC
Permalink
Justin,

I disabled checksum checking, but still get the same error in the doctor output. I'm not sure if the interfacesetup config is working, using defaults for now. I'm going to get with our network guy and review the tap configurations tomorrow and make sure there is no issues there.

Thanks Again,

Ron



## Global BroControl configuration file.

interfacesetup.enabled=1
#To change the default mtu that is configured
#interfacesetup.mtu=9000

#To change the default commands that are used to bring up the interface
#interfacesetup.up_command=/sbin/ifconfig {interface} up mtu {mtu}
#interfacesetup.flags_command=/sbin/ethtool -K {interface} gro off lro off rx off tx off gso off



## If true, don't verify checksums. Useful for running on altered trace
## files, and for saving a few cycles, but at the risk of analyzing invalid
## data. Note that the ``-C`` command-line option overrides the setting of this
## variable.
const ignore_checksums = T &redef;



#################################################################
# Checking if many recent connections have a SAD or had history #
#################################################################
error: 74.24%, 18466 out of 24872 connections are half duplex


################################################
# Checking for recent capture_loss.log entries #
################################################
Capture loss stats:
error: worker=worker-1-1 loss_count=103 noloss_count=0 min_loss=0.794921 max_loss=90.58044 overall_loss=27.2273066532
error: worker=worker-1-10 loss_count=102 noloss_count=0 min_loss=0.448354 max_loss=66.577529 overall_loss=23.5837489489
error: worker=worker-1-11 loss_count=102 noloss_count=0 min_loss=0.720309 max_loss=67.87389 overall_loss=26.0957227921
error: worker=worker-1-12 loss_count=102 noloss_count=0 min_loss=0.130618 max_loss=77.820137 overall_loss=24.5815538385
error: worker=worker-1-13 loss_count=102 noloss_count=0 min_loss=0.361211 max_loss=67.574163 overall_loss=24.8552395812
error: worker=worker-1-14 loss_count=102 noloss_count=0 min_loss=0.368082 max_loss=63.196486 overall_loss=27.4975008733
error: worker=worker-1-15 loss_count=102 noloss_count=0 min_loss=0.191227 max_loss=69.743213 overall_loss=26.8516883949
error: worker=worker-1-16 loss_count=102 noloss_count=0 min_loss=0.148612 max_loss=65.566384 overall_loss=25.6185453556
error: worker=worker-1-17 loss_count=102 noloss_count=0 min_loss=0.327409 max_loss=66.476362 overall_loss=26.3990676615
error: worker=worker-1-18 loss_count=102 noloss_count=0 min_loss=0.734925 max_loss=64.410006 overall_loss=24.4351738027
error: worker=worker-1-19 loss_count=102 noloss_count=0 min_loss=0.205711 max_loss=63.185604 overall_loss=25.5207097775
error: worker=worker-1-2 loss_count=102 noloss_count=0 min_loss=0.62053 max_loss=69.678688 overall_loss=27.4164344389
error: worker=worker-1-20 loss_count=102 noloss_count=0 min_loss=0.28481 max_loss=63.994922 overall_loss=22.7225619728
error: worker=worker-1-21 loss_count=102 noloss_count=0 min_loss=0.376553 max_loss=65.539711 overall_loss=24.2161148411
error: worker=worker-1-22 loss_count=103 noloss_count=0 min_loss=0.71613 max_loss=69.499069 overall_loss=29.2418154822
error: worker=worker-1-23 loss_count=102 noloss_count=0 min_loss=0.397382 max_loss=64.149083 overall_loss=28.2197289348
error: worker=worker-1-24 loss_count=102 noloss_count=0 min_loss=0.681759 max_loss=66.324143 overall_loss=21.3073056893
error: worker=worker-1-25 loss_count=102 noloss_count=0 min_loss=0.33738 max_loss=69.687833 overall_loss=26.2137549719
error: worker=worker-1-26 loss_count=102 noloss_count=0 min_loss=0.387409 max_loss=65.863473 overall_loss=26.7663244831
error: worker=worker-1-27 loss_count=102 noloss_count=0 min_loss=0.574677 max_loss=65.497259 overall_loss=26.6219106573
error: worker=worker-1-28 loss_count=102 noloss_count=0 min_loss=0.146595 max_loss=64.144564 overall_loss=27.6602933778
error: worker=worker-1-29 loss_count=102 noloss_count=0 min_loss=0.22543 max_loss=64.903986 overall_loss=27.0578623844
error: worker=worker-1-3 loss_count=102 noloss_count=0 min_loss=0.554694 max_loss=68.130508 overall_loss=25.1314879388
error: worker=worker-1-30 loss_count=102 noloss_count=0 min_loss=0.523326 max_loss=72.621044 overall_loss=23.007387178
error: worker=worker-1-4 loss_count=102 noloss_count=0 min_loss=0.604213 max_loss=67.538442 overall_loss=27.8744586007
error: worker=worker-1-5 loss_count=102 noloss_count=0 min_loss=0.088526 max_loss=71.771296 overall_loss=26.4599103542
error: worker=worker-1-6 loss_count=102 noloss_count=0 min_loss=0.196425 max_loss=71.430447 overall_loss=23.4866215325
error: worker=worker-1-7 loss_count=102 noloss_count=0 min_loss=0.72676 max_loss=66.640978 overall_loss=26.7013322956
error: worker=worker-1-8 loss_count=102 noloss_count=0 min_loss=0.222815 max_loss=81.943911 overall_loss=27.3586324115
error: worker=worker-1-9 loss_count=103 noloss_count=0 min_loss=0.861021 max_loss=72.243329 overall_loss=29.50492059


################################################################
# Checking what percentage of recent tcp connections show loss #
################################################################
0.04%, 10 out of 24872 connections have capture loss


###################################################################
# Checking if connections are unevenly distributed across workers #
###################################################################
error: No node names in conn log. Install add-node-names package to add a corresponding field.


###############################################################################################################################
# Checking if anything is in the deprecated local-logger.bro, local-manager.bro, local-proxy.bro, or local-worker.bro scripts #
###############################################################################################################################
Nothing found


######################################################################
# Checking if any recent connections have been logged multiple times #
######################################################################
ok, only 0.00%, 0 out of 683 connections appear to be duplicate


############################################################################
# Checking what percentage of recent tcp connections are remote to remote. #
############################################################################
0.01%, 11 out of 100000 connections are remote to remote


###############################################################################
# Checking if bro is linked against a custom malloc like tcmalloc or jemalloc #
###############################################################################
error: configured to use a custom malloc=False


##################################
# Checking pf_ring configuration #
##################################
configured to use pf_ring=False pcap=False plugin=False


############################################
# Checking for recent reporter.log entries #
############################################
error: Found 22 reporter log files in the past 7 days
Recent reporter.log messages:
Reporter::INFO processing suspended
Reporter::INFO processing suspended
221 duplicate messages suppressed
Reporter::INFO processing continued
Reporter::INFO processing continued
268 duplicate messages suppressed
/usr/local/bro/share/bro/base/misc/find-checksum-offloading.bro, line 54 1534944982.705403 Reporter::WARNING Your interface is likely receiving invalid TCP checksums, most likely from NIC checksum offloading. By default, packets with invalid checksums are discarded by Bro unless using the -C command-line option or toggling the 'ignore_checksums' variable. Alternatively, disable checksum offloading by the network adapter to ensure Bro analyzes the actual checksums that are transmitted.
/usr/local/bro/share/bro/base/misc/find-checksum-offloading.bro, line 54 1534971018.604887 Reporter::WARNING Your interface is likely receiving invalid TCP checksums, most likely from NIC checksum offloading. By default, packets with invalid checksums are discarded by Bro unless using the -C command-line option or toggling the 'ignore_checksums' variable. Alternatively, disable checksum offloading by the network adapter to ensure Bro analyzes the actual checksums that are transmitted.
21 duplicate messages suppressed
/usr/local/bro/share/bro/policy/protocols/ssh/geo-data.bro, line 30 1534945001.027149 Reporter::ERROR Bro was not configured for GeoIP support (lookup_location(SSH::lookup_ip))
/usr/local/bro/share/bro/policy/protocols/ssh/geo-data.bro, line 30 1534945023.085730 Reporter::ERROR Bro was not configured for GeoIP support (lookup_location(SSH::lookup_ip))
95 duplicate messages suppressed
1534944917.319061 Reporter::INFO received termination signal
1534944917.325143 Reporter::INFO received termination signal
196 duplicate messages suppressed
1534944917.319061 Reporter::INFO 59439914 packets received on interface ens1f0, 0 dropped
1534944917.325143 Reporter::INFO 45022136 packets received on interface ens1f0, 0 dropped
1534944917.321217 Reporter::INFO 51627314 packets received on interface ens1f0, 0 dropped
1534944917.322883 Reporter::INFO 53891270 packets received on interface ens1f0, 0 dropped
1534944917.324085 Reporter::INFO 88417370 packets received on interface ens1f0, 0 dropped
1534944917.319957 Reporter::INFO 54791897 packets received on interface ens1f0, 0 dropped
1534944917.327305 Reporter::INFO 43678446 packets received on interface ens1f0, 0 dropped
1534944917.330087 Reporter::INFO 59703116 packets received on interface ens1f0, 0 dropped
1534944917.328208 Reporter::INFO 45211345 packets received on interface ens1f0, 0 dropped
1534944917.331134 Reporter::INFO 65262767 packets received on interface ens1f0, 0 dropped
1534944917.334825 Reporter::INFO 55188270 packets received on interface ens1f0, 0 dropped
1534944917.329257 Reporter::INFO 52034434 packets received on interface ens1f0, 0 dropped
1534944917.332107 Reporter::INFO 49495024 packets received on interface ens1f0, 0 dropped
1534944917.333015 Reporter::INFO 49204592 packets received on interface ens1f0, 0 dropped
1534944917.336672 Reporter::INFO 50396962 packets received on interface ens1f0, 0 dropped
1534944917.333880 Reporter::INFO 54559564 packets received on interface ens1f0, 0 dropped
1534944917.338831 Reporter::INFO 55635174 packets received on interface ens1f0, 0 dropped
1534944917.341397 Reporter::INFO 58206831 packets received on interface ens1f0, 0 dropped
1534944917.344880 Reporter::INFO 56679728 packets received on interface ens1f0, 0 dropped
1534944917.343923 Reporter::INFO 50649583 packets received on interface ens1f0, 0 dropped
1534944917.345953 Reporter::INFO 69599231 packets received on interface ens1f0, 0 dropped
1534944917.340038 Reporter::INFO 52187914 packets received on interface ens1f0, 0 dropped
1534944917.342584 Reporter::INFO 54560034 packets received on interface ens1f0, 0 dropped
1534944917.346950 Reporter::INFO 43540603 packets received on interface ens1f0, 0 dropped
1534944917.351083 Reporter::INFO 94170430 packets received on interface ens1f0, 0 dropped
1534944917.326276 Reporter::INFO 61679606 packets received on interface ens1f0, 0 dropped
1534944917.348123 Reporter::INFO 52021287 packets received on interface ens1f0, 0 dropped
1534944917.350083 Reporter::INFO 59524072 packets received on interface ens1f0, 0 dropped
1534944917.349146 Reporter::INFO 53404009 packets received on interface ens1f0, 0 dropped
1534944917.337775 Reporter::INFO 53757378 packets received on interface ens1f0, 0 dropped
1534981838.348667 Reporter::INFO 800278 packets received on interface ens1f0, 0 dropped
1534981838.349759 Reporter::INFO 1300570 packets received on interface ens1f0, 0 dropped
1534981838.352160 Reporter::INFO 1038523 packets received on interface ens1f0, 0 dropped
1534981838.353313 Reporter::INFO 602996 packets received on interface ens1f0, 0 dropped
1534981838.357869 Reporter::INFO 595341 packets received on interface ens1f0, 0 dropped
1534981838.358662 Reporter::INFO 630136 packets received on interface ens1f0, 0 dropped
1534981838.350883 Reporter::INFO 558995 packets received on interface ens1f0, 0 dropped
1534981838.360612 Reporter::INFO 647854 packets received on interface ens1f0, 0 dropped
1534981838.355404 Reporter::INFO 742801 packets received on interface ens1f0, 0 dropped
1534981838.354432 Reporter::INFO 1158902 packets received on interface ens1f0, 0 dropped
1534981838.356568 Reporter::INFO 830080 packets received on interface ens1f0, 0 dropped
1534981838.359734 Reporter::INFO 758136 packets received on interface ens1f0, 0 dropped
1534981838.367822 Reporter::INFO 550137 packets received on interface ens1f0, 0 dropped
1534981838.361739 Reporter::INFO 665719 packets received on interface ens1f0, 0 dropped
1534981838.362719 Reporter::INFO 488230 packets received on interface ens1f0, 0 dropped
1534981838.365881 Reporter::INFO 826799 packets received on interface ens1f0, 0 dropped
1534981838.363606 Reporter::INFO 524107 packets received on interface ens1f0, 0 dropped
1534981838.366718 Reporter::INFO 1593921 packets received on interface ens1f0, 0 dropped
1534981838.368750 Reporter::INFO 631846 packets received on interface ens1f0, 0 dropped
1534981838.364730 Reporter::INFO 555121 packets received on interface ens1f0, 0 dropped
1534981838.369824 Reporter::INFO 610239 packets received on interface ens1f0, 0 dropped
1534981838.371295 Reporter::INFO 575919 packets received on interface ens1f0, 0 dropped
1534981838.373749 Reporter::INFO 1054019 packets received on interface ens1f0, 0 dropped
1534981838.372598 Reporter::INFO 557356 packets received on interface ens1f0, 0 dropped
1534981838.376138 Reporter::INFO 1123785 packets received on interface ens1f0, 0 dropped
1534981838.379690 Reporter::INFO 668758 packets received on interface ens1f0, 0 dropped
1534981838.377535 Reporter::INFO 1421582 packets received on interface ens1f0, 0 dropped
1534981838.378215 Reporter::INFO 669462 packets received on interface ens1f0, 0 dropped
1534981838.375390 Reporter::INFO 634210 packets received on interface ens1f0, 0 dropped
1534981838.380430 Reporter::INFO 537366 packets received on interface ens1f0, 0 dropped
1534981216.903010 Reporter::INFO 6762256 packets received on interface ens1f0, 0 dropped
1534981216.905617 Reporter::INFO 5045464 packets received on interface ens1f0, 0 dropped
1534981216.904381 Reporter::INFO 4320028 packets received on interface ens1f0, 0 dropped
1534981216.909996 Reporter::INFO 3877533 packets received on interface ens1f0, 0 dropped
1534981216.909067 Reporter::INFO 6006595 packets received on interface ens1f0, 0 dropped
1534981216.913121 Reporter::INFO 6123109 packets received on interface ens1f0, 0 dropped
1534981216.911284 Reporter::INFO 3797206 packets received on interface ens1f0, 0 dropped
1534981216.912225 Reporter::INFO 3539501 packets received on interface ens1f0, 0 dropped
1534981216.913347 Reporter::INFO 3215296 packets received on interface ens1f0, 0 dropped
1534981216.907955 Reporter::INFO 4181023 packets received on interface ens1f0, 0 dropped
1534981216.915150 Reporter::INFO 4228532 packets received on interface ens1f0, 0 dropped
1534981216.917084 Reporter::INFO 3895338 packets received on interface ens1f0, 0 dropped
1534981216.919152 Reporter::INFO 4093951 packets received on interface ens1f0, 0 dropped
1534981216.916182 Reporter::INFO 4888118 packets received on interface ens1f0, 0 dropped
1534981216.930824 Reporter::INFO 4969258 packets received on interface ens1f0, 0 dropped
1534981216.923402 Reporter::INFO 3541359 packets received on interface ens1f0, 0 dropped
1534981216.921326 Reporter::INFO 5378841 packets received on interface ens1f0, 0 dropped
1534981216.918085 Reporter::INFO 3692685 packets received on interface ens1f0, 0 dropped
1534981216.927073 Reporter::INFO 4201218 packets received on interface ens1f0, 0 dropped
1534981216.929713 Reporter::INFO 2972462 packets received on interface ens1f0, 0 dropped
1534981216.920093 Reporter::INFO 3657035 packets received on interface ens1f0, 0 dropped
1534981216.934715 Reporter::INFO 3807340 packets received on interface ens1f0, 0 dropped
1534981216.924635 Reporter::INFO 4475718 packets received on interface ens1f0, 0 dropped
1534981216.936953 Reporter::INFO 4226114 packets received on interface ens1f0, 0 dropped
1534981216.928142 Reporter::INFO 5128429 packets received on interface ens1f0, 0 dropped
1534981216.925807 Reporter::INFO 5599345 packets received on interface ens1f0, 0 dropped
1534981216.906854 Reporter::INFO 3459585 packets received on interface ens1f0, 0 dropped
1534981216.935880 Reporter::INFO 4798353 packets received on interface ens1f0, 0 dropped
1534981216.933520 Reporter::INFO 3658076 packets received on interface ens1f0, 0 dropped
1534981216.931230 Reporter::INFO 3996410 packets received on interface ens1f0, 0 dropped
1534970997.269085 Reporter::INFO 75579541 packets received on interface ens1f0, 0 dropped
1534970997.272310 Reporter::INFO 77450480 packets received on interface ens1f0, 0 dropped
1534970997.275471 Reporter::INFO 73248332 packets received on interface ens1f0, 0 dropped
1534970997.270934 Reporter::INFO 91638422 packets received on interface ens1f0, 0 dropped
1534970997.273268 Reporter::INFO 98209619 packets received on interface ens1f0, 0 dropped
1534970997.277737 Reporter::INFO 102107215 packets received on interface ens1f0, 0 dropped
1534970997.274526 Reporter::INFO 80458937 packets received on interface ens1f0, 0 dropped
1534970997.278958 Reporter::INFO 76230406 packets received on interface ens1f0, 0 dropped
1534970997.276604 Reporter::INFO 80722958 packets received on interface ens1f0, 0 dropped
1534970997.279882 Reporter::INFO 100099311 packets received on interface ens1f0, 0 dropped
1534970997.283074 Reporter::INFO 69951029 packets received on interface ens1f0, 0 dropped
1534970997.281120 Reporter::INFO 75612600 packets received on interface ens1f0, 0 dropped
1534970997.282298 Reporter::INFO 77578375 packets received on interface ens1f0, 0 dropped
1534970997.284196 Reporter::INFO 85732570 packets received on interface ens1f0, 0 dropped
1534970997.289385 Reporter::INFO 76071057 packets received on interface ens1f0, 0 dropped
1534970997.287212 Reporter::INFO 82814054 packets received on interface ens1f0, 0 dropped
1534970997.285008 Reporter::INFO 86418433 packets received on interface ens1f0, 0 dropped
1534970997.286085 Reporter::INFO 81872671 packets received on interface ens1f0, 0 dropped
1534970997.288381 Reporter::INFO 80730232 packets received on interface ens1f0, 0 dropped
1534970997.290577 Reporter::INFO 82468402 packets received on interface ens1f0, 0 dropped
1534970997.294493 Reporter::INFO 83079761 packets received on interface ens1f0, 0 dropped
1534970997.291726 Reporter::INFO 75226715 packets received on interface ens1f0, 0 dropped
1534970997.301214 Reporter::INFO 73921183 packets received on interface ens1f0, 0 dropped
1534970997.295776 Reporter::INFO 76411550 packets received on interface ens1f0, 0 dropped
1534970997.293085 Reporter::INFO 78806075 packets received on interface ens1f0, 0 dropped
1534970997.299137 Reporter::INFO 92333460 packets received on interface ens1f0, 0 dropped
1534970997.296879 Reporter::INFO 76489847 packets received on interface ens1f0, 0 dropped
1534970997.297955 Reporter::INFO 77040177 packets received on interface ens1f0, 0 dropped
1534970997.300228 Reporter::INFO 73375388 packets received on interface ens1f0, 0 dropped
1534970997.270025 Reporter::INFO 85569417 packets received on interface ens1f0, 0 dropped
1534977985.775660 Reporter::INFO 23439044 packets received on interface ens1f0, 0 dropped
1534977985.774108 Reporter::INFO 17765795 packets received on interface ens1f0, 0 dropped
1534977985.777926 Reporter::INFO 16760996 packets received on interface ens1f0, 0 dropped
1534977985.776700 Reporter::INFO 17259361 packets received on interface ens1f0, 0 dropped
1534977985.780145 Reporter::INFO 18713880 packets received on interface ens1f0, 0 dropped
1534977985.779011 Reporter::INFO 16901868 packets received on interface ens1f0, 0 dropped
1534977985.781391 Reporter::INFO 18714893 packets received on interface ens1f0, 0 dropped
1534977985.783404 Reporter::INFO 16282338 packets received on interface ens1f0, 0 dropped
1534977985.782445 Reporter::INFO 16293375 packets received on interface ens1f0, 0 dropped
1534977985.784534 Reporter::INFO 29243070 packets received on interface ens1f0, 0 dropped
1534977985.786378 Reporter::INFO 23601045 packets received on interface ens1f0, 0 dropped
1534977985.790525 Reporter::INFO 20118319 packets received on interface ens1f0, 0 dropped
1534977985.785406 Reporter::INFO 16852112 packets received on interface ens1f0, 0 dropped
1534977985.791658 Reporter::INFO 16997069 packets received on interface ens1f0, 0 dropped
1534977985.789483 Reporter::INFO 16957150 packets received on interface ens1f0, 0 dropped
1534977985.787405 Reporter::INFO 15663429 packets received on interface ens1f0, 0 dropped
1534977985.793724 Reporter::INFO 15732747 packets received on interface ens1f0, 0 dropped
1534977985.792848 Reporter::INFO 20594428 packets received on interface ens1f0, 0 dropped
1534977985.795803 Reporter::INFO 17450153 packets received on interface ens1f0, 0 dropped
1534977985.794824 Reporter::INFO 19299170 packets received on interface ens1f0, 0 dropped
1534977985.799558 Reporter::INFO 21344899 packets received on interface ens1f0, 0 dropped
1534977985.797155 Reporter::INFO 15890636 packets received on interface ens1f0, 0 dropped
1534977985.803070 Reporter::INFO 16998576 packets received on interface ens1f0, 0 dropped
1534977985.798511 Reporter::INFO 15619707 packets received on interface ens1f0, 0 dropped
1534977985.799711 Reporter::INFO 17108695 packets received on interface ens1f0, 0 dropped
1534977985.805260 Reporter::INFO 16192817 packets received on interface ens1f0, 0 dropped
1534977985.804186 Reporter::INFO 18282565 packets received on interface ens1f0, 0 dropped
1534977985.802072 Reporter::INFO 16240542 packets received on interface ens1f0, 0 dropped
1534977985.806260 Reporter::INFO 22287402 packets received on interface ens1f0, 0 dropped
1534977985.788479 Reporter::INFO 15441331 packets received on interface ens1f0, 0 dropped
1534978504.904141 Reporter::INFO 1151439 packets received on interface ens1f0, 0 dropped
1534978504.906757 Reporter::INFO 672616 packets received on interface ens1f0, 0 dropped
1534978504.910902 Reporter::INFO 798199 packets received on interface ens1f0, 0 dropped
1534978504.907730 Reporter::INFO 820256 packets received on interface ens1f0, 0 dropped
1534978504.905443 Reporter::INFO 1001287 packets received on interface ens1f0, 0 dropped
1534978504.908955 Reporter::INFO 1377270 packets received on interface ens1f0, 0 dropped
1534978504.911752 Reporter::INFO 1220954 packets received on interface ens1f0, 0 dropped
1534978504.912143 Reporter::INFO 1084331 packets received on interface ens1f0, 0 dropped
1534978504.914831 Reporter::INFO 1029296 packets received on interface ens1f0, 0 dropped
1534978504.916659 Reporter::INFO 912276 packets received on interface ens1f0, 0 dropped
1534978504.915751 Reporter::INFO 772154 packets received on interface ens1f0, 0 dropped
1534978504.918896 Reporter::INFO 696069 packets received on interface ens1f0, 0 dropped
1534978504.909908 Reporter::INFO 1059539 packets received on interface ens1f0, 0 dropped
1534978504.917650 Reporter::INFO 733904 packets received on interface ens1f0, 0 dropped
1534978504.913752 Reporter::INFO 1163756 packets received on interface ens1f0, 0 dropped
1534978504.921136 Reporter::INFO 788919 packets received on interface ens1f0, 0 dropped
1534978504.919927 Reporter::INFO 710985 packets received on interface ens1f0, 0 dropped
1534978504.923178 Reporter::INFO 819349 packets received on interface ens1f0, 0 dropped
1534978504.925485 Reporter::INFO 1297935 packets received on interface ens1f0, 0 dropped
1534978504.924478 Reporter::INFO 918820 packets received on interface ens1f0, 0 dropped
1534978504.922142 Reporter::INFO 735662 packets received on interface ens1f0, 0 dropped
1534978504.932603 Reporter::INFO 1145281 packets received on interface ens1f0, 0 dropped
1534978504.928078 Reporter::INFO 714078 packets received on interface ens1f0, 0 dropped
1534978504.902959 Reporter::INFO 696518 packets received on interface ens1f0, 0 dropped
1534978504.929356 Reporter::INFO 1289706 packets received on interface ens1f0, 0 dropped
1534978504.931471 Reporter::INFO 716412 packets received on interface ens1f0, 0 dropped
1534978504.930397 Reporter::INFO 803952 packets received on interface ens1f0, 0 dropped
1534978504.934880 Reporter::INFO 1615952 packets received on interface ens1f0, 0 dropped
1534978504.927006 Reporter::INFO 663539 packets received on interface ens1f0, 0 dropped
1534978504.933828 Reporter::INFO 857559 packets received on interface ens1f0, 0 dropped
1534891115.907765 Reporter::INFO 5219842 packets received on interface ens1f0, 0 dropped
1534891115.910010 Reporter::INFO 8455238 packets received on interface ens1f0, 0 dropped
1534891115.912626 Reporter::INFO 5228397 packets received on interface ens1f0, 0 dropped
1534891115.908858 Reporter::INFO 5747798 packets received on interface ens1f0, 0 dropped
1534891115.911476 Reporter::INFO 5697178 packets received on interface ens1f0, 0 dropped
1534891115.915011 Reporter::INFO 4743597 packets received on interface ens1f0, 0 dropped
1534891115.918402 Reporter::INFO 6230492 packets received on interface ens1f0, 0 dropped
1534891115.913823 Reporter::INFO 7316772 packets received on interface ens1f0, 0 dropped
1534891115.920544 Reporter::INFO 8210898 packets received on interface ens1f0, 0 dropped


-----Original Message-----
From: Azoff, Justin S <***@illinois.edu>
Sent: Wednesday, August 22, 2018 1:01 PM
To: Ron McClellan <***@ao.uscourts.gov>
Cc: ***@bro.org
Subject: Re: [Bro] BRO Logger crashing due to large DNS log files
Post by Ron McClellan
Sorry, forgot to send that, I did re-enable the conn.log.
Ron
A few things look like they are still not right.
Post by Ron McClellan
/usr/local/bro/share/bro/base/misc/find-checksum-offloading.bro, line 54 1534945591.937813 Reporter::WARNING Your interface is likely receiving invalid TCP checksums, most likely from NIC checksum offloading. By default, packets with invalid checksums are discarded by Bro unless using the -C command-line option or toggling the 'ignore_checksums' variable. Alternatively, disable checksum offloading by the network adapter to ensure Bro analyzes the actual checksums that are transmitted.
/usr/local/bro/share/bro/base/misc/find-checksum-offloading.bro, line 54 1534891136.995923 Reporter::WARNING Your interface is likely receiving invalid TCP checksums, most likely from NIC checksum offloading. By default, packets with invalid checksums are discarded by Bro unless using the -C command-line option or toggling the 'ignore_checksums' variable. Alternatively, disable checksum offloading by the network adapter to ensure Bro analyzes the actual checksums that are transmitted.
5 duplicate messages suppressed
Ensuring all the offloading is setup correctly on your interfaces may fix a few of these. I have another plugin for broctl that can do this automatically for you:

https://packages.bro.org/packages/view/7520ca9d-4fb7-11e8-88be-0a645a3f3086

you just need to install it and then add

interfacesetup.enabled=1

to your broctl.cfg

The next time bro restarts it will ensure a bunch of settings are set properly using ethtool.
Post by Ron McClellan
#################################################################
# Checking if many recent connections have a SAD or had history #
#################################################################
error: 24.30%, 52 out of 214 connections are half duplex
#################################################################
# Checking if many recent connections have a SAD or had history #
#################################################################
error: 75.53%, 14289 out of 18918 connections are half duplex
These 2 show that things are not working well at all. The rest of the checks don't mean much until this is fixed, so this is the thing to focus on.

This could be caused by an internal load balancing problem, or by an upstream issue. It's pretty easy to figure out which, I just haven't worked out the best way to have bro-doctor automate it.

What you want to do is run this script from a host that bro will see the request and response:

for x in $(seq -w 1 19); do
echo -e 'GET / HTTP/1.1\r\nHost: www.bro.org\r\n\r\n' |
socat - tcp-connect:www.bro.org:80,sp=300$x,reuseaddr;
sleep 1;
done

Wait a few minutes (if packets are being dropped, it may take 5 minutes for the log entries to show up only after the connections timeout) and then run

cat conn.log |bro-cut -d ts id.orig_p id.resp_h id.resp_p history orig_pkts resp_pkts|fgrep 192.150.187.43|fgrep 300

You should see something like this:

2018-08-22T11:53:46-0500 30001 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:53:47-0500 30002 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:53:48-0500 30003 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:53:49-0500 30004 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:53:50-0500 30005 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:53:52-0500 30006 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:53:53-0500 30007 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:53:54-0500 30008 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:53:55-0500 30009 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:53:56-0500 30010 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:53:57-0500 30011 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:53:59-0500 30012 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:54:00-0500 30013 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:54:01-0500 30014 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:54:02-0500 30015 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:54:03-0500 30016 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:54:04-0500 30017 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:54:06-0500 30018 192.150.187.43 80 ShADFadf 6 4
2018-08-22T11:54:07-0500 30019 192.150.187.43 80 ShADFadf 6 4

duplicate, missing, or split entries indicate different problems.
Post by Ron McClellan
######################################################################
###### # Checking what percentage of recent tcp connections are remote
to remote. #
######################################################################
######
error: 52.85%, 52853 out of 100000 connections are remote to remote
In your networks.cfg make sure you have listed all of your 'local' address space.


Justin Azoff

Loading...