Bug 11455 – DHCP takes a long time to reconnect (was Firmware update does not recover if WiFi is switched off)

Bug 11455 - DHCP takes a long time to reconnect (was Firmware update does not recover if WiFi is switched off)

Summary:

DHCP takes a long time to reconnect (was Firmware update does not recover if ...

Status:	CLOSED FIXED

Product:	SqueezePlay
Classification:	Unclassified
Component:	Networking
Version:	unspecified
Platform:	PC Other

Importance:	P1 normal (vote)
Target Milestone:	7.5.0
Assigned To:	Felix Mueller

URL:
Keywords:	alarm_related

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2009-03-24 13:13 UTC by Dan Evans
Modified:	2010-04-08 17:25 UTC (History)
CC List:	2 users (show)

See Also:
Category:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dan Evans 2009-03-24 13:13:58 UTC

(using r4875)

After successfully connecting Fab4 to my WiFi, I started the firmware update, but... midway through the update I switched off the WiFi in my router.  Approx. 60 seconds later I got an error message about this.  That's fine.

But after switching the WiFi back on, pressing "Try Again..." continues to fail.  

I would think after the WiFi was returned that Fab4 would reconnect.  Is this not true?

...Update...

Several minutes later I tried again and it succeeded.  I had left the Fab4 in that state-- waiting to continue.  It must have reconnected during that time.  But it really was several several minutes.

Slow to reconnect under those conditions?

Comment 1 Richard Titmuss 2009-04-01 02:23:47 UTC

I have tested this here, and within 30 seconds after turning the router back on the firmware update can be restarted. It probably takes at least 15-20 seconds for the router to boot and fab4 to reconnect, longer probably if the router also supports dsl.

Comment 2 Dan Evans 2009-04-06 14:55:25 UTC

(using r5156)
I am still seeing this as broken.

I switched OFF the WiFi and the fw update failed.  No problem.  I switched ON the WiFi and waited until I knew it was live and working again. (other devices successfully connected.)  

Then I pressed "Try again" and it still failed with, "There was a problem installing this update. Please try again..."  Pressing "Try again" just jumps immediately back to this error.  I tried this for about a minute.

...Several minutes later...

Again, I left it sitting for a few minutes and tried again-- it reconnected and re-downloaded the fw update.  

I guess I just don't understand why Fab4 is taking several minutes to renegotiate the connection.

---Update---

I tested this on a different Fab4, to be sure this wasn't a hardware thing, and I saw the same behavior.  This time I timed it.  First, I began the firmware update, and then:

02:42:00 - Turned OFF the WiFi
02:43:00 - Fab4 correctly detected a problem  (time-out seems long though: 60s?)
02:43:30 - Turned ON the WiFi
02:44:00 - Other SBs successfully connected to WiFi, 
02:44:30 - I began pressing "Try Again" on Fab4 
02:46:00 - Firmware update restarted on Fab4 and went smoothly from there.

Granted it does recover, but after 2 full minutes.  If I were a customer, I'd have given up much earlier.

Comment 3 Richard Titmuss 2009-04-06 15:02:19 UTC

Dan, I need logs to understand this. Have you tried with a different router?

Comment 4 Richard Titmuss 2009-04-08 02:43:53 UTC

I think this bug is caused because of the increasing interval between DHCP requests. So if fab4 is not connected (AP turned off, ethernet disconnected) for several minutes, it can take a while for the DHCP to resolve itself.

For wireless i wonder if the wpa_cli action script is still needed to send a SIGUSR1 to udhcpc to restart things quicker. For wired udhcpc probably needs to monitor the eth0 link status and restart when the link is attached.

Felix what do you think? Any easier solutions? We should probably at least fix wireless for MP.

Comment 5 Felix Mueller 2009-04-08 11:00:21 UTC

Busybox' udhcpc defaults to a fix 20s retry timeout. I.e. the interval is not increasing.

This means it can take a maximum of 20s after the network is available until Fab4 will have an ip address again.

The retry timeout can be modified with the '-A' parameter.

Do you suggest to decrease the retry timeout?

Comment 6 Felix Mueller 2009-04-08 15:17:13 UTC

I am not sure reducing the retry timeout makes a lot of sense here. There must be something else that it takes about 2 minutes until it recovers.

If I reduce the retry timeout from the standard 30s to let's say 10s that only brings the above scenario down to about 1m 40s which isn't really better, isn't it.

Comment 7 Felix Mueller 2009-04-09 07:13:57 UTC

So far I found this: If I power down the router during fw update the first failed DNS takes 20s all the following ones fail instantaneously. When the network is back it takes about 2 minutes until the DNS is successful again.

Richard's comment taken from campfire:

ah, yes. the dns timeout in libc is long.
ok, so i don't think there is an easy solution to that then.

Comment 8 Wadzinski Tom 2009-08-06 11:15:49 UTC

Another thing I saw related to this:
Using SN, I took it out of wireless range. After returning to wireless range it couldn't reconnect to SN for a few minutes with SlimProto.lua:532 dns lookup failed for fab4.squeezenetwork.com
I tested again watching the Diagnostics screen. Here is the order of what I see: a) first wireless comes back, then b) dhcp completes with SN dns failed, then c) SC ping succeeds, then after a few minutes d) SN DNS succeeds

The thinking is that when dns lookup occurs with no connection, the dns connect doesn't timeout until 120 seconds.

Suggestions from campfire regarding the DNS timeout.
1) Only do a toip when the network is up
2) Hold onto a "backup cache", such that the last good dns value si saved and if there is a dns failure, use the "last good" value.

Comment 9 Ben Klaas 2009-08-26 07:49:55 UTC

this is an administrative shuffle on priority fields to help make better judgment on the top end of the priority list. P4->P5, P3->P4, and P2->P3.

Comment 10 Pat Ransil 2009-10-23 05:09:33 UTC

Administrative move of 7.5 bugs. All P2, P3, P4 being downgraded one level. Will then split P1s.

Comment 11 Chris Owens 2010-01-04 09:27:10 UTC

Bug meeting team has a consensus that this bug may be related to a number of issues.

Comment 12 SVN Bot 2010-01-12 11:54:15 UTC

 == Auto-comment from SVN commit #8319 to the jive repo by felix ==
 == https://svn.slimdevices.com/jive?view=revision&revision=8319 ==

Bug: 11455 
Description:  
- Reduced DNS timeout to 10 seconds (was 2 minutes). My tests on Jive, Baby and Touch showed that a 10 seconds timeout is still enough to allow the message pipe to empty even if a lot of DNS resolve requests are queued up while the network is down.
- Fixed slow memory leak in DNS resolver thread (occured while the network was down)

Comment 13 Felix Mueller 2010-01-12 11:56:12 UTC

Update hours worked.

Comment 14 Felix Mueller 2010-01-13 06:30:36 UTC

Change is in 7.4 r8319 and 7.5 r8321.

Comment 15 Chris Owens 2010-04-08 17:25:47 UTC

This bug has been marked fixed in a released version of Squeezebox Server or the accompanying firmware or mysqueezebox.com release.

If you are still seeing this issue, please let us know!