Okay, so after much pain trying to solve this issue, I have discovered my problem. This may very well explain other issues people are seeing as well. I have verified several times that this is an issue only with 2.9, as 2.8.28 has never caused this behavior and proved to work fine on this network.
My simplified setup (ascii visio?):
{ Internet } --- [ https www server ] | | | 172.16.2.4/30 | [ DSL Router (No NAT, just routed) ] | 10.0.0.1/29 | | | 10.0.0.2/29 | [MT ROS 2.9 (NAT's, w/hotspot) ] | 192.168.1.1/24 | | { clients, wired and wireless - bridged }
For this example and explaination, the follow IPs need mentioned:
192.168.1.100 - wireless client with problems
192.168.1.1 - client's gateway, ether2/hotspot IP which is NAT
10.0.0.2 - WAN on MT (ether1)
10.0.0.1 - LAN on DSL Router, MT's gateway
Again, the problem was that some people, including myself at one point, could not access certain parts of the https www server, even after entering it in the walled garden, by domain, IP, ports and protocols, etc - every combination there is, including also adding these to the pre-hotspot chains (which have proven to be less useful that we were hoping for) on both the nat and forward/input chains. Still, nada.
Sniffing packets at various locations only really should retransmits, but nothing more. So today I plopped in log rules in between most of the dynamic hotspot rules and discovered something strange. When the seesion stopped working, or never began and failed, I was seeing entries in the log. It started spitting out dropped packets with a src-address of 10.0.0.1 and dst-address of 192.168.1.100. The source IP is that of the DSL Router. When the session starts, the DSL Router never shows up as part of the connection (and shouldn't). But at some point, this breaks down and the packets appear to be originating from the DSL router instead of the web server. How can this be?
So, solving this was as simple as adding the DSL router IP into the walled garden. Now, everything works fine. This is one of the BritePort 4200 routers that Covad uses, which we have never had a problem with. We tested a 2.8.28 hotspot with an equivalent config, and didn't have this problem. We have fiddled with everything inside and out trying to figure this out including the conn-track settings (which seem a little too tight for default settings).
Is this caused by a possible failure in the connection tracking? Is there some other explaination for this? I could not reproduce this at home, which has a Actiontech DSL (NATed) router with my test gateway on the inside. If anyone could shed some light on this, it would much appreciated...while we've got things working, I'd really like to know why a next hop router should have to be entered in a walled garden *only* for https walled garden functionality.
If anything, I hope this may help other folks with similar issues