Bug 11984 - SN Database / Replication / DNS issues
: SN Database / Replication / DNS issues
Status: CLOSED FIXED
Product: MySqueezebox.com
Classification: Unclassified
Component: Database
: Prod
: PC Windows XP
: -- normal (vote)
: Hotfix
Assigned To: Brandon Black
http://forums.slimdevices.com/forumdi...
: Support-Important
Depends on:
Blocks: 11986 12163
  Show dependency treegraph
 
Reported: 2009-05-06 16:10 UTC by Anoop Mehta
Modified: 2009-10-05 16:44 UTC (History)
13 users (show)

See Also:
Category: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Anoop Mehta 2009-05-06 16:10:33 UTC
Support has been dealing with customers having Database issues via SN. 

Here is a prime example of what I had to do with a customer today. 

Issue: Customer could not log into SN. 

Steps taken to fix the issue:
1. Logged into all three databases via the SN admin tool and searched for customers account via his email. 
2. Changed customers password on all 3 databases. 
3. asked customer to see if he is able to login. Customer was now fully able to login. 

Before logging into all 3 databases I tried to simply use the admin page to change customers password and that did not work.  

Laron please input on the scenarios that you have ran into regarding database issues.
Comment 1 Anoop Mehta 2009-05-06 16:14:44 UTC
Julius please add to this bug with your findings as well.
Comment 2 Brandon Black 2009-05-06 16:18:32 UTC
There have been additional lag issues lately.  More work to address lag is coming soon.  Most likely your initial attempt to reset the password was at one datacenter, and the customer was at another.  Eventually even your first change would propagate to all, but this is lag-dependent.  There are now 4 changes of this password propagating, plus another if he changed it again after gaining access (actually, if he did, there's a good chance one of your lagged-out admin changes will overwrite his change later in the day).
Comment 3 Julius Dauz 2009-05-06 16:19:52 UTC
The strangest issue I've experienced today was a customer's Controller kept disappear from his account after adding it to the account. But looking up the MAC address showed the Controller was connected to the account. After looking up the MAC address in the admin page, then finding the account the Controller was associated to (customer's account), then clicking to get into the customer's account, and there are no players or Controllers linked to the account. 

Andy also saw this issue first hand.

Other than that, I've not experienced too much else today, but the have seen the issue that Anoop described in the past plus other issues such as Invalid Pins, etc.
Comment 4 Walker LaRon 2009-05-06 17:05:17 UTC
Hello All,

I see this issue lots of times within the last 4 or 5 months.  I have spoken with Dan and Andy about this replication issue before, and the "re-assign/delete" player tool was made available to us in the SN Admin tool.  This feature now also allows you to delete a player even if it is connected to SqueezeNetwork, however you must do this on all 4 servers, since the original issue is caused by replication.

Per Andy, I was advised that if this issue does occur, to try the following:

-Log on to all 4 servers, and go to the SN Admin Tool
-Deleted SB Receiver and Controller from ALL servers via MAC addresses
-Have the customer factory reset SB Duet (Controller and Receiver)
-Walk through set up, and have the customer connect to SN as the music source
-Have customer try to enter new Pin Number 
-If it give the invalid pin error, search all 4 servers for the SB Controller's MAC address
-Once it is found on a server (usually the server closest to the customer's location), it will be linked to the unlinked account
-Assign to customers account

This has helped most of the time in resolving this issue.

Please note that this issue has also prevented the Squeezebox Duet from connecting to SqueezeCenter if the SqueezeNetwork credentials inside of SqueezeCenter.


Thanks,

LaRon
Comment 5 Walker LaRon 2009-05-06 17:45:00 UTC
In addition to my above comments, below are scenario's in which I have had to use this work around

Scenario 1

Initial set up of the Squeezebox Duet, and the customer does not want to use SqueezeCenter

When trying to connect to SqueezeNetwork as the music source, the Squeezebox Receiver updates it's firmware, and shows connected to the customer's SqueezeNetwork Account (by using the Admin Tool just for www.squeezenetwork.com), but the controller never shows up.  When we try to enter the PIN number, it says that it is invalid. I would then log on to the other servers to see if the MAC address could be located on another server. It usually shows up on another server, linked to the unlinked account.  Even when we link the player to the correct SN account, it sometimes doesn't show up on the normal www.squeezenetwork.com for hours.


Scenario 2

The customer has connected to SqueezeNetwork before, but now the Squeezebox Receiver shows up as connected, but the Squeezebox Controller states it cannot connnect to SqueezeNetwork.  I then have the customer unplug their SB receivers, and turn off the controller.
I would then search all of the other servers via the customer's email address used for SN, to see if it is still shows that the SB Receiver and Controller is still connected to another server (even when the battery is removed from the controller, and the receiver is unplugged).  When I finally find the server(s) that shows the Squeezebox Receiver and Squeezebox Controller are still connected to, I try to disconnect both the SB Receiver and Controller's from that/those servers.  Sometimes the SB devices will NOT disconnect.  In that case, the quickest way to get them connected is to delete the players from all servers, and have them factory reset, connect to SN, get a new pin, and hope that resolves the issue.  If not, I follow the steps I have outlined in comment #4, which helps most of the time.


Scenerio 3

Cannot connect to SqueezeCenter or SqueezeNetwork, but can connect to the home network.  This is after the Squeezebox Duet has been set up and working prior, and the customer is adamant that nothing has changed.

In this case, I would have the customer restart their computer, and power cycle the router.  If after, if the customer still cannot connected to SqueezeCenter or SqueezeNetwork, I have the customer open SqueezeCenter, and remove their SqueezeNetwork credentials.  

If the customer can then connect to SqueezeCenter, I lean towards SqueezeNetwork as being the issue.  I then search all of our servers for the customers SN email address, and usually I find that one (or more) of our servers are still displaying that their SB controller and receiver are connected.

I then try to disconnect the devices from each server.  Once this is done, I have the customer try connecting to SqueezeNetwork.  If they are successful, then I have them re-enter their SqueezeNetwork credentials into SqueezeCenter, and then try connecting to SqueezeCenter.  This usually works.  If this doesn't work, I then result to the steps I have described in comment #4.


Overall, I believe that in all of the above scenarios, the issues have all pointed at our SqueezeNetwork servers not being in sync with each other.  I have experienced this issue here first hand as well.

Thanks,

LaRon
Comment 6 James Richardson 2009-05-07 08:47:11 UTC
http://forums.slimdevices.com/showthread.php?t=62887

Possibly related forum post
Comment 7 James Richardson 2009-05-08 06:48:49 UTC
I ran into problems again last night, my players were not responding to Web UI changes, because the players were attached to SV and my Web Browser was assigned to DC
Comment 8 Blackketter Dean 2009-05-08 07:15:00 UTC
Andy/Brandon: Any idea what's up?
Comment 9 Andy Grundman 2009-05-08 07:27:53 UTC
It's just the usual database lag.  Currently the lag is basically 0.  I don't think we need a bug on this, it's not going to be possible to eliminate the risk of lag completely.
Comment 10 Brandon Black 2009-05-08 07:34:05 UTC
But in regards to this, I'd really like more data:

> I ran into problems again last night, my players were not responding to Web UI
> changes, because the players were attached to SV and my Web Browser was
> assigned to DC

Under reasonably normal conditions, such as last night, your browser and players should not be connecting to different datacenters from the same point on the network (your home DSL/Cable).  Something's not right there, and it's DNS-related.  Either the player and the browser are getting different DNS answers at the same time, or one of them is caching an answer from when your primary datacenter was offline way longer than it should.  Aside from basic caching issues at DNS resolvers, the other thing that I suspect in these scenarios is the use of OpenDNS, either explicitly by the user, or implicitly by our player firmware after a single transient failure of a local DNS lookup.
Comment 11 James Richardson 2009-05-08 07:39:21 UTC
(In reply to comment #10)
> But in regards to this, I'd really like more data:
> 
> > I ran into problems again last night, my players were not responding to Web UI
> > changes, because the players were attached to SV and my Web Browser was
> > assigned to DC
> 
> Under reasonably normal conditions, such as last night, your browser and
> players should not be connecting to different datacenters from the same point
> on the network (your home DSL/Cable).  Something's not right there, and it's
> DNS-related.  

I agree with you on this one

> Either the player and the browser are getting different DNS
> answers at the same time, or one of them is caching an answer from when your
> primary datacenter was offline way longer than it should.  Aside from basic
> caching issues at DNS resolvers, the other thing that I suspect in these
> scenarios is the use of OpenDNS, either explicitly by the user, or implicitly
> by our player firmware after a single transient failure of a local DNS lookup.

I can tell you that I have never "explicitly" enabled or use OpenDNS.  Nor do 99% of our customers reporting this issue to TechSupport.  So the issue must lay somewhere in the player firmware using OpenDNS
Comment 12 Julius Dauz 2009-05-08 11:23:11 UTC
I just experienced the same issue I expreinced when this bug was filed.

I assisted a customer in setting up his Duet.

We got to the Select a Music Source part and selected SN.

I entered the Pin into the account and it look like it was added, then disappeared. 

And continued to do so with every subsequent retry.
Comment 13 Spies Steven 2009-05-08 11:30:57 UTC
This does sound to me as well that the users ip3k based players are falling back to using OpenDNS while the SqueezeOS based players and whatever the users web browser is running on is not.  Of course if the user has more than one ip3k player only one of them might be falling back to using OpenDNS.

So I have a suggestion and a question.

So my suggestion for any user experiencing issues that might be related to ip3k players falling back to using OpenDNS would be to switch the users entire network over to using OpenDNS.  The OpenDNS web site has very good information on how to do this.  If those users that switch over to using OpenDNS for their entire network no longer see these issue we know that the problem is indeed the ip3k players falling back to using OpenDNS I believe.

I personally have seen the ability to sync players on SqueezeNetwork come and go at home and OpenDNS on ip3k might be the issue for me as well.  Could that be the case?  I will try this workaround at home myself and report back.

So my question is if this does turn out to be that ip3k players are switching to OpenDNS to easily or for other reasons what would be the best solution?  Would disabling the OpenDNS feature cause more harm than good?
Comment 14 Brandon Black 2009-05-11 09:21:36 UTC
I don't think switching them to OpenDNS is going to help, I think it would hurt more really.  We initially thought that OpenDNS would give fairly consistent answers, but I no longer think this is the case.  OpenDNS operates multiple DNS caches spread around geographically, all answering the same anycast IP addresses.  Assuming a healthy global internet, the user's query will be responded to by the "closest" OpenDNS cache on the network, which will in turn resolve the closest SN datacenter for them.  However, due to the nature of anycast, small transient latency or packet loss issues on the path between the user and the closest OpenDNS cache could cause the first response to instead come from a distant OpenDNS cache, resolving to a different datacenter.  The net result is that over the long term average, OpenDNS will usually return the "correct" datacenter for SN for a given user, but there is a small but non-zero chance that it will randomly return the wrong one on any given query at any time.

So if the player and the PC with the browser (and the duet, etc) are all using OpenDNS directly on every connection, it's a recipe for the worst case.  If they were all using a local DNS cache (such as the one in many end-users' routers), and that cache were in turn using OpenDNS, the situation would be slightly better, as at least whatever result OpenDNS gave at a single instant in time would be cached for all devices on the network for a 5 minute window, reducing (but not eliminating) the odds of the problem.  This would be the situation with more savvy customers who choose to use OpenDNS at home.

The best scenario would be to not have OpenDNS involved at all, including in the firmware.

The history and logic behind OpenDNS in the ip3k firmware goes something like this:

We used to try to resolve SN using the users' supplied DNS server, then failing that we would fall back to a hardcoded IP address for SN (I think we fell through a hardcoded DNS server IP too, IIRC).  This was supposed to smooth the user experience so that our product worked even when the users' networks were broken (no local DNS working).

We got rid of the hardcoded IP in the firmware when we moved datacenters, but we still wanted a fallback to workaround broken user DNS, so we had the DNS server list fall back to trying OpenDNS.  I was on board with this at the time, but I've changed my mind now.  I really think in future firmwares going forward, we should eliminate OpenDNS from the firmware, and simply retry and/or fail if we can't get local DNS resolution from the user/DHCP-supplied DNS server addresses.  If those DNS servers are persistently unresponsive, which is when we intended to use OpenDNS, then either (a) they typo'd manually configuring the player, or (b) their whole network sucks and they probably can't even reliably browse the internet.  I don't think the problems OpenDNS can cause are worth it for the few cases where it might help.

To make matters worse, we're not looking for persistent failure with the current OpenDNS fallback in ip3k.  I'm pretty sure one lost request or timeout will cause an immediate fallback, so transient local issues with their lan/wlan could cause OpenDNS to come into the picture w/ ip3k to boot.
Comment 15 James Richardson 2009-05-12 11:17:05 UTC
http://forums.slimdevices.com/showthread.php?t=63089

Another real world example - here, the alarms are not syncing between SN and the Player
Comment 16 James Richardson 2009-06-11 13:19:07 UTC
Brandon will be testing out SQL changes to the data centers, which should hopefully address this issue.

That, and the changes made to OpenDNS in the IP3K devices, should hopefully solve this issue.
Comment 17 James Richardson 2009-06-11 13:24:54 UTC
*** Bug 11986 has been marked as a duplicate of this bug. ***
Comment 18 James Richardson 2009-06-11 13:25:12 UTC
*** Bug 12163 has been marked as a duplicate of this bug. ***
Comment 19 Brandon Black 2009-07-30 14:10:31 UTC
Between the DNS changes in the players (remove OpenDNS fallback), the ip3k DNS bugfix, and the MySQL replication fixes on the SN side, this should be largely resolved now.  Can we close this (and open new ones for any new issues?).
Comment 20 Chris Owens 2009-07-30 17:06:58 UTC
Yes, let's
Comment 21 James Richardson 2009-10-05 16:44:03 UTC
This bug has been fixed in the latest release of MySqueezebox.com (formally known as SqueezeNetwork)!

If you are still experiencing this problem, feel free to reopen the bug with your new comments and we'll have another look.