Q1
It is likely safest to use s() for your univariate smooths as (at least at one time) Simon has suggested he might make specifying univariate smooths
with tensor product functions te(), ti(), and t2() defunct/deprecated.
The issue of different scales only comes in when you want a smooth interaction; if you have two or more covariates where the units differ or where one expects the degree of wiggliness in one covariate to be different to that in the other covariates, you should use a tensor product. For spatial coordinates, we might still choose a tensor product smooth over a 2D TPRS smooth (s(x,y, bs = "tp")) if the change in the x direction is expected to be more wiggly than in the y direction.
Q2
With the default TPRS smoother, you don't need to set knots. Technically there is a knot at each unique combination of covariates involved in the smooth. Each of those knots has a basis function, which is then eigendecomposed and the eigenvalues and associated eigenvectors are sorted in terms of the magnitude of the eigenvalues. The first k eigenvectors associated with the k largest eigenvalues are then taken as the new basis of size k that is actually used to represent the smooth effect $f(x_i)$ in the model. This is known as a low-rank thin plate regression spline.
This spline is computationally demanding to set up (the eigendecomposition is the expensive part, even with algorithms that can find only the first k eignevalues and eignevectors. As such, for tensor product smooths — where one naturally will have more data because one is trying to smooth in more dimensions which is more demanding of data — the default basis for the marginal smooths is a cubic regression spline. These splines do have knots, but the actual position of those knots typically doesn't make much different to the resulting spline, unless you have very unevenly spaced values of the covariate in one or more of the marginal smooths. by default, the knots for a CRS are place at the boundary of the data and then at evenly space quantiles of the covariate — this has the effect of concentrating knots where you have unique data.
The general advice with modern GAMs like those implemented in {mgcv} is to set k to be as large as you think you need to achieve the expected wiggliness and then add a little bit. We add a little bit (increase k above what you actually think is the right wiggliness) because a basis of size k + a functions (for a smaller than k) is a much richer basis of functions of wiggliness k than a basis of only size k. Basically, we are trying to ensure that the basis expansion we create to represent a smooth function in the model is sufficiently rich (large) enough to either include the true function or a close approximation to the true function.
Having fitted the model, an important extra diagnostic step is to check is the basis size(s) you specified were large enough. The k.check() function in {mgcv} provides one such heuristic test for this — it is looking for extra structure in the residuals when ordered by the covariate(s) for individual smooths. An alternative approach is simply to take the deviance residuals for the model and then fit a constant variance, identity link model with residuals as the response and a smooth of the same covariates involved in the smooth that you wish to test but double the k used in the smooth. If any of the smooths in this new model show significant wiggliness, then that's a good sign that the k you used in the original fit was too low, so you can go back, double k and refit your original model. You may require several rounds of this if you guessed low on many of the ks initially.
You don't need to set k on the random effect smooths bs = "re"). You are best leaving this at the default (IIRC it is ignored anyway) for this basis. The ridge penalty on these effects will perform the required shrinkage of the model coefficients to give you the same thing as estimates (posterior modes/means) of the random effects.
Q3
Once you as the user have set k for each of the (marginal) smooths in the model, where k is the expected upper limit (plus a bit) on the wiggliness of the functional effect on the response, a wiggliness penalty on each smooth is created. This penalty is used to avoid overly wiggly, i.e. over fitted, estimated smooths. The parameters for the basis functions of each smooth, any fixed effects, and the smoothing parameters are chosen to maximise the penalised log-likelihood of the data. The smoothing parameters are what controls how much penalty we pay for having a wiggly function. The penalised log-likelihood trades off fit for generality; we avoid over fitting because we,a ll else equal, prefer smooth(er) functions to complex (wiggly) ones.
When fitting with method = "REML" or method = "ML" (which aren't the defaults but you should certainly use either of them over the default), the model you are fitting is an empirical Bayesian one, with (improper) Gaussian priors on the coefficients. These priors imply the same concept that we would, all else equal, favour smoother functions over wigglier ones.
Q4
The smoothing parameters are either estimated or fixed at values you specify via the sp argument. You can also use fx = TRUE in one or more smooths to not penalize that particular smooth, which would then use k degrees of freedom (minus something for the identifiability constraints).
Invariably you don't need to do anything to the smoothing parameters and can just let the model estimate them. The main thing that controls the resulting wiggliness is the values of k that you specified. The smoothing parameters just control how much penalty we pay for having a wiggly function. Hence you should think in terms of the wiggliness of your functions and the associated penalty (which is a function of k), rather than the smoothing parameters themselves.
In summary, you should think about the size of the basis you need for each smooth and not worry about smoothing parameters. Once we have a basis representation for the functions you want to estimate, the penalty will largely take care of the overfitting problem. How well all this works depends on all the other modelling decisions you made; have you got the right response distribution, link function, right model terms, etc.
As for your specific model, with the size of data you are fitting (potentially) you might get better performance fitting with bam() or as you have a binomial GAMM via gamm4::gamm4() as the latter will be more efficient when it comes to fitting random effects. You would have to change the moel structure though, as you don:t want to use the random effect splines in the gamm4::gamm4() model:
m <- gam(Outcome ~ s(Control_1) # control vars
+ s(Control_2)
+ s(X1) # main effects
+ s(X2)
+ s(X3)
+ s(Main) # main variable of interest
+ ti(Main, X1, k = c(5, 5)) # tensor interactions
+ ti(Main, X2, k=c(5,5))
+ ti(Main, X3, k=c(5,5)),
random = ~ (1 | Subject) + (1 | Item), # REs
REML = TRUE, # restricted ML
data= data,
family=binomial) # logistic