I'm dealing with a Bayesian Hierarchical Linear Model, here the network describing it.
$Y$ represents daily sales of a product in a supermarket(observed).
$X$ is a known matrix of regressors, including prices, promotions, day of the week, weather, holidays.
$S$ is the unknown latent inventory level of each product, which causes the most problems and which I consider a vector of binary variables, one for each product with $1$ indicating stockout and so the unavailability of the product. Even if in theory unknown I estimated it through a HMM for each product, so it is to be considered as known as X. I just decided to unshade it for proper formalism.
$\eta$ is a mixed effect parameter for any single product where the mixed effects considered are the product price, promotions and stockout.
$\beta$ is the vector of fixed regression coefficients, while $b_1$ and $b_2$ are the vectors of mixed effects coefficient. One group indicates brand and the other indicates flavour (this is an example, in reality I have many groups, but I here report just 2 for clarity).
$\Sigma_{\eta}$ , $\Sigma_{b_1}$ and $\Sigma_{b_2}$ are hyperparameters over the mixed effects.
Since i have count data let's say that I treat each product sales as Poisson distributed conditional on the Regressors (even if for some products the Linear approximation holds and for others a zero inflated model is better). In such a case I would have for a product $Y$ (this is just for who's interested in the bayesian model itself, skip to the question if you find it uninteresting or non trivial :)):
$\Sigma_{\eta} \sim IW(\alpha_0,\gamma_0)$
$\Sigma_{b_1} \sim IW(\alpha_1,\gamma_1)$
$\Sigma_{b_2} \sim IW(\alpha_2,\gamma_2)$, $\alpha_0,\gamma_0,\alpha_1,\gamma_1,\alpha_2,\gamma_2$ known.
$\eta \sim N(\mathbf{0},\Sigma_{\eta})$
$b_1 \sim N(\mathbf{0},\Sigma_{b_1})$
$b_2 \sim N(\mathbf{0},\Sigma_{b_2})$
$\beta \sim N(\mathbf{0},\Sigma_{\beta})$, $\Sigma_{\beta}$ known.
$\lambda _{tijk} = \beta*X_{ti} + \eta_i*X_{pps_{ti}} + b_{1_j} * Z_{tj} + b_{2_k} Z_{tk} $,
$ Y_{tijk} \sim Poi(exp(\lambda_{tijk})) $
$i \in {1,\dots,N}$, $j \in {1,\dots,m_1}$, $k \in {1,\dots,m_2}$
$Z_i$ matrix of mixed effects for the 2 groups, $X_{pps_i}$ indicating price, promotion and stockout of product considered. $IW$ indicates inverse Wishart distributions, usually used for covariance matrices of normal multivariate priors. But it's not important here. An example of a possible $Z_i$ could be the matrix of all the prices, or we could even say $Z_i=X_i$. As regards the priors for the mixed-effects variance-covariance matrix, I would just try to preserve the correlation between the entries, so that $\sigma_{ij}$ would be positive if $i$ and $j$ are products of the same brand or either of the same flavour.
The intuition behind this model would be that the sales of a given product depend on its price, its availability or not, but also on the prices of all the other products and the stockouts of all the other products. Since I don't want to have the same model (read: same regression curve) for all the coefficients, I introduced mixed effects which exploit some groups I have in my data, through parameter sharing.
My questions are:
- Is there a way to transpose this model to a neural network architecture? I know that there are many questions looking for the relationships between bayesian network, markov random fields, bayesian hierarchical models and neural networks, but I didn't find anything going from the bayesian hierarchical model to neural nets. I ask the question about neural networks since, having a high dimensionality of my problem (consider that I have 340 products), parameter estimation through MCMC takes weeks (I tried just for 20 products running parallel chains in runJags and it took days of time). But I don't want to go random and just give data to a neural network as a black box. I would like to exploit the dependence/independence structure of my network.
Here I just sketched a neural network. As you see, regressors($P_i$ and $S_i$ indicate respectively price and stockout of product $i$) at the top are inputed to the hidden layer as are those product specific (Here I considered prices and stockouts). (Blue and black edges have no particular meaning, it was just to make the figure more clear) . Furthermore $Y_1$ and $Y_2$ could be highly correlated while $Y_3$ could be a totally different product (think about 2 orange juices and red wine), but I don't use this information in neural networks. I wonder if the grouping information is used just in weight inizialization or if one could customize the network to the problem.
Edit, my idea:
My idea would be something like this: as before, $Y_1$ and $Y_2$ are correlated products, while $Y_3$ is a totally different one. Knowing this a priori I do 2 things:
- I preallocate some neurons in the hidden layer to any group I have, in this case I have 2 groups {($Y_1,Y_2$),($Y_3$)}.
- I initialize high weights between the inputs and the allocated nodes (the bold edges) and of course I build other hidden nodes to capture the remaining 'randomness' in the data.
Thank you in advance for your help



For less complex models (for instance not considering mixed effects over brand) and for each product $\lambda_{itjk} = \boldsymbol{\beta} \mathbf{X_t} + \boldsymbol{\eta}i * \mathbf{Z{it}}$ it works fine and I'm able to make good inference about what happens..
When increasing complexity I have some convergence problems: I may miss something in my models, but I think the data don't help too since I have lot of skewed predictors, missing data et cetera..
– Tommaso Guerrini Mar 28 '17 at 00:30Sorry for asking a question not in the proper place
– Tommaso Guerrini Mar 28 '17 at 00:35Thank you very much Luigi by the way.. I'm in that situation where I have no more time to dig into the problems as I should, since I have an incoming deadline.. It seems like STAN is a great tool, but the learning curve is a little steep to really realize its incredible performance (as of now I realized its speed up wrt JAGS)
– Tommaso Guerrini Mar 28 '17 at 00:50