4

What are the advantages of using a log linear representation rather than a table representation? Is it simply computational issue (avoid overflowing)?

For example, in a markov network A-B we can represent the factor P(A,B) as a table:

A B P(A,B)
0 0 10
0 1 1
1 0 1
1 1 10

Alternatively, if we represent the factor P(A,B) as log linear model:

$$P(A,B) = \exp\bigg(\sum\limits_{i=1}^4\theta_i f_i(D_i)\bigg)$$

Here $f$ is an indicator function. Then basically each $\theta$ is the log of the entry in the table representation. What would be the advantages of the log linear representation here?

Dzung Nguyen
  • 223
  • 2
  • 11

2 Answers2

1

There is a different literature supporting the use of log-linear models that begins with Bishop, et al., Discrete Multivariate Analysis in 1975. Extends through Leo Goodman's RC models beginning in the 80s, Agresti's Categorical Data Analysis, books by Stephen Feinberg and includes Wickens excellent book Multiway Contingency Tables Analysis for the Social Sciences, 1989. Needless to say, these approaches are all appropriate for frequency, "count" or classificatory data.

The example given above is for a simple, 2x2 table. It may be the case that there are few advantages using log-linear models for this case since a sophisticated analysis isn't needed. One big advantage of the log-linear framework is the flexibility it offers in testing different table structures in higher dimensions than 2X2 that distinguish, e.g., independence on the diagonal (the classic chi-square test) from conditional independence in a table as a function of how you slice that table up. In addition and beyond the chi-squares, odds-ratios are readily estimable as more suitable metrics of effect size.

Clearly, there is more than one way to analyze frequency data. How one chooses to do it is a function of one's training and comfort level.

user78229
  • 10,594
  • 2
  • 24
  • 45
0

Most textbooks and slides I found just state that it's "common" or "convenient" to do so but don't explain why.

I've found two reasons that apply to Markov Networks:

  1. Exponentiating the weighted features makes sure that they are all larger than zero. Normalizing with the partition function Z makes sure that they all sum up to one. This way we get a valid probability distribution. This advantage is explained in this Coursera course (you have to register first).

  2. We take advantage of the fact that the exponential function is its own derivation. This makes it much easier to compute the derivative when learning the weights, e.g. with gradient descent.

From a numerical point of view, it is preferable to sum the log of probabilities because multiplying many small probabilities may risk an underflow of the computer's numerical precision.

Suzana
  • 153