0

I am trying to understand the application of Chi-Squared test for independence between predictor and response variables as it applies to feature selection in machine learning and exploratory data analysis. I understand that there are few types of Chi-Squared test for independence.

However, I do not understand what exactly is this relationship that Chi-Squared measures. Is it simply a measure of a correlation coefficient between distributions of categorical variables?

I would prefer an intuitive explanation over mathematical proof.

Thank you!

verkter
  • 307
  • Just to be clear -- are you primarily interested in the 2x2 case or the general r $\times$ c case, or something else? If the r x c case, what do you mean by "correlation"? – Glen_b Dec 19 '16 at 05:54
  • Correlation as a linear relationship between two variables. I am not sure what you mean by 2x2, I am interested in getting a general understanding. – verkter Dec 19 '16 at 06:26
  • A chi-squared test for independence is conducted on data that falls into two (or more) categorical variables. How are you defining "linear" between things falling into categories? – Glen_b Dec 19 '16 at 11:58
  • Linear is probably not a correct. I assumed that this is the relationship that Chi-Squared test is measuring. Looking at the definition of what chi-squared test for independence does: "It is used to determine whether there is a significant association between the two variables." (http://stattrek.com/chi-square-test/independence.aspx?Tutorial=AP) What is this "significant association" actually is? – verkter Dec 20 '16 at 00:13
  • To return to the question about 2x2 vs r x c (since it impacts the possible ways of interpreting an idea of linear association)... how many categories do you have in each variable? – Glen_b Dec 20 '16 at 00:16
  • I don't have a specific example for you. I am mostly interested in it's application and use in feature selection or how variables can be related to each other. You can be very general and high level. I don't understand what you mean by "2x2" and "r x c". Thanks. – verkter Dec 20 '16 at 00:27
  • Categorical variables are usually displayed in a table of counts. The first number in the product is the number of rows in the table (number of levels of the categorical row-variable) and the second number is the number of columns in the table. So if you have two binary variables in your chi-square, you display them as a 2x2 table, showing the counts in each combination. Go here, scroll down to "Finding Expected Counts from Observed Counts" and you'll see such a table (labelled "Observed table"). ... ctd – Glen_b Dec 20 '16 at 01:02
  • ctd... In that case it has two rows and three columns (not counting headings or totals), and so is a 2 x 3 table. – Glen_b Dec 20 '16 at 01:08
  • Let's say that it is larger than 2 x 2. Thanks. – verkter Dec 20 '16 at 17:48
  • Any more thoughts on this? Thanks – verkter Dec 24 '16 at 20:55

0 Answers0