7

If link with subscribers, containing about 12000 clients, goes down. One aggregated ethernet containing two links which goes to two different FPC's. All clients try to authorize on JuniperMX with the help of DHCP Discover messages. This is eoip network.

In case of authorizing too many clients (with the help of RADIUS), FPC (MPC 3D 16x 10GE) CPU goes about 100% and stops processing other clients authorizations.

Is this normal and I should limit clients per FPC or I have some misconfigurations?

juniper-1> show version 
Model: mx960
Junos: 16.1R6-S3.1
JUNOS OS Kernel 64-bit  [20180310.ba55661_builder_stable_10]
...

And licences are used less than exists in the system:

juniper-1> show system license     
License usage: 
                          Licenses     Licenses    Licenses
  Feature name                used    installed      needed 
  ...
  scale-subscriber           39614       128000           0
  ...

All four LAG's has identical configuration and composed from 2 10G links. Every link connected to different card (xe-0/2/3, xe-1/0/1):

juniper-1> show configuration interfaces ae3
flexible-vlan-tagging;
auto-configure {
    vlan-ranges {
        dynamic-profile vlan-autosense-profile {
            accept dhcp-v4;
            ranges {
                3-4094;
            }
        }
    }
}
encapsulation flexible-ethernet-services;
aggregated-ether-options {
    load-balance {
        no-adaptive;
    }
    minimum-links 1;
    link-speed 10g;
    lacp {
        active;
        periodic fast;
    }
}

Allowed vlans are configured from downstream switch. Every vlan allowed only once on one of four links. I.e. VLAN100 allowed on ae1 only.

juniper-1> show configuration dynamic-profiles dhcp-local-server-profile 
interfaces {
    demux0 {
        unit "$junos-interface-unit" {
            no-traps;
            proxy-arp;
            demux-options {
                underlying-interface "$junos-underlying-interface";
            }
            targeted-distribution;
            family inet {
                demux-source {
                    $junos-subscriber-ip-address;
                }
                inactive: filter {
                    input "$junos-input-filter";
                    output "$junos-output-filter";
                }
                unnumbered-address lo0.0;
            }
        }
    }
}

Every client has from three to five activated services. Autosence profile configuration:

juniper-1> show configuration dynamic-profiles vlan-autosense-profile   
interfaces {
    "$junos-interface-ifd-name" {
        unit "$junos-interface-unit" {
            demux-source inet;
            proxy-arp;
            vlan-id "$junos-vlan-id";
            family inet {
                unnumbered-address lo0.0;
            }
        }
    }
}

juniper-1> show configuration system configuration-database    
max-db-size 314572800;

Allowed users with public IP's has four services (speed limits - local networks, global networks, state networks). If client has private IP it has additional service - "redirect gateway to NAT server". Finally all clients which has denied network access has service for "denying".

Tomato
  • 123
  • 10
  • 2
    Can you post the output of "show version" and "show system license"? It would also help to have your AE/LAG config as well as the related VLAN and DHCP dynamic-profiles. Lastly, are both line cards "MPC 3D 16x 10GE"? This will help me determine if it's a simple or nuanced problem. – Jordan Head Feb 05 '19 at 14:35
  • @JordanHead DHCP server - all processing on the same box. DHCP server local profile has targeted-distribution - set dynamic-profiles dhcp-local-server-profile interfaces demux0 unit "$junos-interface-unit" targeted-distribution. – Tomato Feb 05 '19 at 20:17
  • Thank you, this helps! Some additional follow-up, if I may. Was this working before/did anything change when this stopped working? Does the MX stop processing DHCP packets entirely (does "show dhcp server statistics" increment?) or is it that subscribers authenticate and get an IP, but the MX is not building the sessions in the forwarding table? Can you also add the configuration for the "vlan-autosense-profile" and for "show configuration system configuration-database"? I'm also still unclear on if both FPCs that participate in each LAG are the same model or not. – Jordan Head Feb 06 '19 at 13:08
  • Ah, yes, sorry, both FPC is the same MPC 3D 16x 10GE and LAG connected to separate FPC's. This problem was from the start using a router. Btw router also handles all BGP external connections. – Tomato Feb 06 '19 at 14:48
  • After router deployment, we faced with this problem when LAG fully goes down (during that first problem there was one LAG with 4 10G links). After LAG comes up, users try to authenticate an FCP simply stops authorising clients, but transit traffic (for users that do not need to authorise on box) still working. We don't check show dhcp server summary, but show subscribers counts do not increment. The workaround for this is delete all vlans from subscriber link and gradually add vlans (so the maximum count of users in added vlans at one time is not more than 1000). – Tomato Feb 06 '19 at 14:48
  • Several months ago we tried to install 3rd FPC (same model), after disabling one link of one LAG the problem raises again (As I understand, due to targeted distribution, system try to move users to working link from disabled one). This means that if my downstream switch reboots my FPC can hang once more. – Tomato Feb 06 '19 at 14:48
  • 1
    JUNOS has functionality built into the subscriber management software to automatically throttle subscriber connection setup rate (assuming the requests are being received), but that happens after DDoS protection. Can you check if you're tripping DDoS violations (you can use "show ddos-protection protocols violations" to see current, and "show ddos-protection protocols statistics" to see older violations. Depending on how your downstream switch is configured, it's possible the DHCP requests are being flooded and tripping the thresholds. – Jordan Head Feb 06 '19 at 17:35
  • Also, it looks like you have your configuration database set, I assume you also have "set system services subscriber-management enable" and "set chassis network-services enhanced-ip" configured as well? Also, did you reboot the router after enabling these? – Jordan Head Feb 06 '19 at 17:36
  • Yes of course router was rebooted. Now the state is ok bu what it will show in case of malfunction. Thanks for show commands. At now show ddos-protection protocols violations Packet types: 213, Currently violated: 0. It has some values in violation counts column discover-14, request-45, renew-4, bad-pack.. -7629 – Tomato Feb 07 '19 at 10:33
  • If some packet counts exceeds maximum limits - Juniper will drop request or it can lead to unpredictable behavior? – Tomato Feb 07 '19 at 10:37
  • @JordanHead Hi, the night before yesterday I've done my experimenting with moving LAG containing 14k client to empty FPC. – Tomato Mar 01 '19 at 13:09
  • I should say that all clients subscribed in 5-10 minutes or less.As you noticed earlier you keep the uplinks and subscriber access links on different cards so I think that my links are connected incorrectly, thank you very much for your notice. Now I plan to distribute uplinks and subscriber access link on different FPC's (2 FPC for only subscribers). – Tomato Mar 01 '19 at 13:10
  • Maybe this is because full FPC was given to only subscribers. – Tomato Mar 01 '19 at 13:11
  • Strange, but show krt queue shows all lines with zeros: ... Routing table add queue: 0 queued Interface add/delete/change queue: 0 queued ... – Tomato Mar 01 '19 at 13:13
  • In the same time show subscribers summary was more interesting. 02:32:26 - Subscribers by Client Type Total: 21249 02:33:36 - Subscribers by Client Type Total: 26197 02:34:21 - Subscribers by Client Type Total: 30050 02:36:15 - Subscribers by Client Type Total: 32095 – Tomato Mar 01 '19 at 13:15

0 Answers0