[fedora-arm] Wandboard Quad network dies under load?

Discussion:

Derek Atkins

2016-03-05 04:01:19 UTC

Hi,

I'm having an issue with two different wandboard quad systems; one is
running F22, the other is running F23. When the system is under high
network load, specifically high transmit load, after a while the network
just gives up. Technically it's not VERY high load, only about 2MB/s,
but it's high transmit load -- high download load seems to be fine as
far as I can tell. I know that "gives up" isn't a very technical term,
but I frankly don't know what else to call it.

* dmesg doesn't say anything about the link going down
* ifconfig shows the interface still has an IP address
* arp, however, seems to start failing (and my NFS server has an
incomplete arp address)
* ping doesn't work to anywhere (regardless of the contents of the arp table)
* DNS doesn't work (obviously -- no packets are coming or going).

I can usually recover by doing:

nmcli con down "Wired connection 1"
nmcli con up "Wired connection 1"

(the 'up' results in the message "Error: Connection activation failed.")
After that I need to pull the ethernet plug, count to 5-10, and then
plug it in again. Then I'll get the messages:

[30540.554006] fec 2188000.ethernet eth0: Link is Down
[30553.558837] fec 2188000.ethernet eth0: Link is Up - 1Gbps/Full - flow contro

(sorry for the cut messages; minicom serial console doesn't wrap lines)

After I do this the system has network again. However it's quite
frustrating that I have to go through all these hoops. Note that just
pulling the network cable by itself does not seem sufficient to reset
the network.

Is this a hardware problem or a software problem (or a combination of
the two)? I've had it happen on this one system three times today; I
can definitely reliably repeat it (although it does take a couple hours
until it dies). It's also happened on another system, but I've not seen
it happen since I stopped pulling data from it.

Any suggestions? I'd like to not have to go out and spend more money to
buy an Atom-based solution, even though it might be better for my use
case due to AES-NI.

Thanks,

-derek

--
Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
Member, MIT Student Information Processing Board (SIPB)
URL: http://web.mit.edu/warlord/ PP-ASEL-IA N1NWH
***@MIT.EDU PGP key available

Sean Omalley

2016-03-07 02:59:54 UTC

Permalink

Derek,

I was/am having similar issues with the atheros wireless drivers on x86_64. The DMA stuff was kicking in for some reason. Yesterdays update mostly cleared it up for me (it was once every 45 minutes, it is down to once a day) as near as I can tell. My issue sounds very similar to yours, but it has been going on for like 6 kernel updates.

I was lucky, there is a patch for debugging they added to the dma stop function, which actually logged. There may not be the equivalent in your driver.

Otherwise, IIRC, when I was working with the pogoplug, I did have an issue with duplex settings that kept flaking out. Where the switch would go into half duplex mode on a whim. Changing the cable, even though it worked, fixed it. I think it was like "microfractures" in the wire. It may also have ended up on a different port on the switch. That is the type of stuff that usually happens under high loads. I haven't looked at the freescale FEC chip or driver.
Â Sean

From: Derek Atkins <***@MIT.EDU>
To: ***@lists.fedoraproject.org
Sent: Friday, March 4, 2016 11:01 PM
Subject: [fedora-arm] Wandboard Quad network dies under load?

Hi,

I'm having an issue with two different wandboard quad systems; one is
running F22, the other is running F23.Â When the system is under high
network load, specifically high transmit load, after a while the network
just gives up.Â Technically it's not VERY high load, only about 2MB/s,
but it's high transmit load -- high download load seems to be fine as
far as I can tell.Â I know that "gives up" isn't a very technical term,
but I frankly don't know what else to call it.

* dmesg doesn't say anything about the link going down
* ifconfig shows the interface still has an IP address
* arp, however, seems to start failing (and my NFS server has an
Â incomplete arp address)
* ping doesn't work to anywhere (regardless of the contents of the arp table)
* DNS doesn't work (obviously -- no packets are coming or going).

I can usually recover by doing:

Â nmcli con down "Wired connection 1"
Â nmcli con up "Wired connection 1"

(the 'up' results in the message "Error: Connection activation failed.")
After that I need to pull the ethernet plug, count to 5-10, and then
plug it in again.Â Then I'll get the messages:

[30540.554006] fec 2188000.ethernet eth0: Link is DownÂ Â Â Â
[30553.558837] fec 2188000.ethernet eth0: Link is Up - 1Gbps/Full - flow contro

(sorry for the cut messages; minicom serial console doesn't wrap lines)

After I do this the system has network again.Â However it's quite
frustrating that I have to go through all these hoops.Â Note that just
pulling the network cable by itself does not seem sufficient to reset
the network.

Is this a hardware problem or a software problem (or a combination of
the two)?Â I've had it happen on this one system three times today; I
can definitely reliably repeat it (although it does take a couple hours
until it dies).Â It's also happened on another system, but I've not seen
it happen since I stopped pulling data from it.

Any suggestions?Â I'd like to not have to go out and spend more money to
buy an Atom-based solution, even though it might be better for my use
case due to AES-NI.

Thanks,

-derek

--
Â Â Â Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
Â Â Â Member, MIT Student Information Processing BoardÂ (SIPB)
Â Â Â URL: http://web.mit.edu/warlord/ Â PP-ASEL-IAÂ Â N1NWH
Â Â Â ***@MIT.EDUÂ Â Â Â Â Â Â Â Â Â Â Â PGP key available

Derek Atkins

2016-03-07 15:10:54 UTC

Permalink

Hi Sean,

Well, I'm (now) running the current F23 release .. and I'm testing it
now. I should know in an hour or two if there's an issue. Is there
something I need to turn on to see the debug log issue? I should note
this is wired ethernet, not wifi, where it stops talking to the net.

I doubt it's a cabling issue -- I have this issue on two different
boards located different places in my house connected via different
cables to different switches. The only common factor is Wandboard Quad
and high data transmission from the WB-Q.

Thanks,

-derek

Post by Sean Omalley
Derek,
I was/am having similar issues with the atheros wireless drivers on x86_64.
The DMA stuff was kicking in for some reason. Yesterdays update mostly cleared
it up for me (it was once every 45 minutes, it is down to once a day) as near
as I can tell. My issue sounds very similar to yours, but it has been going on
for like 6 kernel updates.
I was lucky, there is a patch for debugging they added to the dma stop
function, which actually logged. There may not be the equivalent in your
driver.
Otherwise, IIRC, when I was working with the pogoplug, I did have an issue
with duplex settings that kept flaking out. Where the switch would go into
half duplex mode on a whim. Changing the cable, even though it worked, fixed
it. I think it was like "microfractures" in the wire. It may also have ended
up on a different port on the switch. That is the type of stuff that usually
happens under high loads. I haven't looked at the freescale FEC chip or
driver.
Sean
--------------------------------------------------------------------------
Sent: Friday, March 4, 2016 11:01 PM
Subject: [fedora-arm] Wandboard Quad network dies under load?
Hi,
I'm having an issue with two different wandboard quad systems; one is
running F22, the other is running F23. When the system is under high
network load, specifically high transmit load, after a while the network
just gives up. Technically it's not VERY high load, only about 2MB/s,
but it's high transmit load -- high download load seems to be fine as
far as I can tell. I know that "gives up" isn't a very technical term,
but I frankly don't know what else to call it.
* dmesg doesn't say anything about the link going down
* ifconfig shows the interface still has an IP address
* arp, however, seems to start failing (and my NFS server has an
incomplete arp address)
* ping doesn't work to anywhere (regardless of the contents of the arp table)
* DNS doesn't work (obviously -- no packets are coming or going).
nmcli con down "Wired connection 1"
nmcli con up "Wired connection 1"
(the 'up' results in the message "Error: Connection activation failed.")
After that I need to pull the ethernet plug, count to 5-10, and then
[30540.554006] fec 2188000.ethernet eth0: Link is Down
[30553.558837] fec 2188000.ethernet eth0: Link is Up - 1Gbps/Full - flow contro
(sorry for the cut messages; minicom serial console doesn't wrap lines)
After I do this the system has network again. However it's quite
frustrating that I have to go through all these hoops. Note that just
pulling the network cable by itself does not seem sufficient to reset
the network.
Is this a hardware problem or a software problem (or a combination of
the two)? I've had it happen on this one system three times today; I
can definitely reliably repeat it (although it does take a couple hours
until it dies). It's also happened on another system, but I've not seen
it happen since I stopped pulling data from it.
Any suggestions? I'd like to not have to go out and spend more money to
buy an Atom-based solution, even though it might be better for my use
case due to AES-NI.
Thanks,
-derek
--
Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
Member, MIT Student Information Processing Board (SIPB)
URL: http://web.mit.edu/warlord/ PP-ASEL-IA N1NWH
_______________________________________________
arm mailing list

--
Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
Member, MIT Student Information Processing Board (SIPB)
URL: http://web.mit.edu/warlord/ PP-ASEL-IA N1NWH
***@MIT.EDU PGP key available

Sean Omalley

2016-03-08 02:19:57 UTC

Permalink

Hi Derek,

Post by Derek Atkins
Well, I'm (now) running the current F23 release .. and I'm testing it
now. I should know in an hour or two if there's an issue. Is there
something I need to turn on to see the debug log issue? I should note
this is wired ethernet, not wifi, where it stops talking to the net.

I saw a bug with the atheros ethernet driver as well, which had different symptoms.
The portion of the code that goes into an infinite loop for me has debugging that spews to the log files. (bool ath9k_hw_stopdmarecv(struct ath_hw *ah, bool *reset))
However, if your driver doesn't do that, it would not show up anywhere.. :)

Post by Derek Atkins
I doubt it's a cabling issue -- I have this issue on two different
boards located different places in my house connected via different
cables to different switches. The only common factor is Wandboard Quad
and high data transmission from the WB-Q.

It may not be. In fact, it all might be a red herring...

It might be a bad setting in the device tree between the version of the wandboard you have, they apparently use two different FEC chips.

Also, you might take a poke at this too if you haven't seen it:

https://boundarydevices.com/i-mx6-ethernet/

I am leaning toward something that changed that broke a few things and I am guessing it isn't arm specific.

Sean

Derek Atkins

2016-03-08 14:25:51 UTC

Permalink

Hi Sean,

Post by Sean Omalley
Hi Derek,

I saw a bug with the atheros ethernet driver as well, which had different symptoms.
The portion of the code that goes into an infinite loop for me has
debugging that spews to the log files. (bool
ath9k_hw_stopdmarecv(struct ath_hw *ah, bool *reset))
However, if your driver doesn't do that, it would not show up anywhere.. :)

It may not be. In fact, it all might be a red herring...
It might be a bad setting in the device tree between the version of
the wandboard you have, they apparently use two different FEC chips.
https://boundarydevices.com/i-mx6-ethernet/
I am leaning toward something that changed that broke a few things and
I am guessing it isn't arm specific.

As I just mentioned in my response to Peter, upgrading to
4.4.3-300.fc23.armv7hl significantly helped. My backup lasted 18+ hours
before it died, and it only died because I ran a "du -sh"
simultaneously. Granted, that shouldn't have put it over the edge
either, but when I was running 4.2.3 backup only lasted 2-4 hours before
dying on its own.

So 4.4.3 is definitely an improvement.

Post by Sean Omalley
Sean

-derek

--
Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
Member, MIT Student Information Processing Board (SIPB)
URL: http://web.mit.edu/warlord/ PP-ASEL-IA N1NWH
***@MIT.EDU PGP key available

Peter Robinson

2016-03-08 12:35:40 UTC

Permalink

Post by Derek Atkins
I'm having an issue with two different wandboard quad systems; one is
running F22, the other is running F23. When the system is under high
network load, specifically high transmit load, after a while the network
just gives up. Technically it's not VERY high load, only about 2MB/s,
but it's high transmit load -- high download load seems to be fine as
far as I can tell. I know that "gives up" isn't a very technical term,
but I frankly don't know what else to call it.

Rev B or C?

Post by Derek Atkins
* dmesg doesn't say anything about the link going down
* ifconfig shows the interface still has an IP address
* arp, however, seems to start failing (and my NFS server has an
incomplete arp address)
* ping doesn't work to anywhere (regardless of the contents of the arp table)
* DNS doesn't work (obviously -- no packets are coming or going).
nmcli con down "Wired connection 1"
nmcli con up "Wired connection 1"
(the 'up' results in the message "Error: Connection activation failed.")
After that I need to pull the ethernet plug, count to 5-10, and then
[30540.554006] fec 2188000.ethernet eth0: Link is Down
[30553.558837] fec 2188000.ethernet eth0: Link is Up - 1Gbps/Full - flow contro
(sorry for the cut messages; minicom serial console doesn't wrap lines)
After I do this the system has network again. However it's quite
frustrating that I have to go through all these hoops. Note that just
pulling the network cable by itself does not seem sufficient to reset
the network.

what happens if you "rmmod fec; sleep 5; modprobe fec" does that have
the same effect as all of the above?

Post by Derek Atkins
Is this a hardware problem or a software problem (or a combination of
the two)? I've had it happen on this one system three times today; I
can definitely reliably repeat it (although it does take a couple hours
until it dies). It's also happened on another system, but I've not seen
it happen since I stopped pulling data from it.

If it's the former it should be able to be worked around with the
later. I've not seen it but then I don't use my WBQ for high load. The
i.MX6 onboard NICs do have a through put issue in that they can't do
line speed Gbit, but rather top out around 450mbps (if memory serves)
but that shouldn't affect stability.

Peter

Derek Atkins

2016-03-08 14:23:10 UTC

Permalink

Hi Peter,

Post by Peter Robinson

Rev B or C?

The one I'm working on right now says Rev C1. I don't know the rev of
the other one -- I'd have to go open it up to see. I can do that if you
want the answer, but it's actually my production mythtv backend (and
still running F22) so I can't really run a bunch of tests on that.

Post by Peter Robinson

Post by Derek Atkins
After I do this the system has network again. However it's quite
frustrating that I have to go through all these hoops. Note that just
pulling the network cable by itself does not seem sufficient to reset
the network.

what happens if you "rmmod fec; sleep 5; modprobe fec" does that have
the same effect as all of the above?

Mostly, yes. I had to run:

ifconfig eth0 down; rmmod fec; sleep 5; modprove fec

(without the ifconfig down the rmmod didn't work). But that did bring
it back to life.

I should note that as of about 5:30am I was going to respond to Sean and
say that "the upgrade to 4.4.3-300.fc23.armv7hl fixed it." However
while it IS better (lasting about 18 hours versus 2-4) , it did
eventually die around 6:30am (when I manually ran a du -sh to see how
much of the backup had completed in 18 hours). So clearly something
between 4.2.3 and 4.4.3 improved the situation, but didn't completely
correct it.

Post by Peter Robinson

Right now I think I'm CPU bound via encfs running AES so I think it's
fine that I can't hit the full 1Gbps. However I suspect the current set
of ARM solutions available might not be the right platform for my
backup server.

Although openssl speed on my Wandboard says I should be getting 20MB/s,
I appear to only be getting about 1MB/s. (My laptop seems to be able to
do about 100MB/s according to openssl -- strangely "openssl engine" does
not report "aesni" on my F23 x86_64 laptop).

Post by Peter Robinson
Peter

-derek

--
Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
Member, MIT Student Information Processing Board (SIPB)
URL: http://web.mit.edu/warlord/ PP-ASEL-IA N1NWH
***@MIT.EDU PGP key available