9

I'm Stata-proficient but learning SPSS for my new position. I am using a simple dataset to do very basic regressions and comparing to see if the results are the same. They're not. I'm close, but the magnitudes of the betas and significance are slightly different. The data was copy and pasted into each from an Excel; I didn't use a Stata file in SPSS, or vice versa. For SPSS, I did not weight it, it's using listwise deletion, and it's on the "enter" method. I presume Stata is doing the same, as its default (but correct me if I'm wrong and it's a different default!).

Any ideas on what else to check? I'm doing just a simple linear regression.

Data: https://www.dropbox.com/s/8g31cjf8vr69i44/rwj%20county%20data.xls?dl=0

Syntax

SPSS

REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF
/DEPENDENT FreeLunch
/METHOD=ENTER FoodInsecure Rural Female @18 Hispanic.

(Or, for point and click, analyze->regression->linear; forces a choice under "method" for enter/stepwise/remove/backward/forward.)

Stata

reg percentfreelunch percentfoodinsecure rural female under18 hispanic

Data is in Excel and was pasted into both.

Results

SPSS

           Var. |   Unst.B | Std.Err. |   St.B |     t | Sig.
     (Constant) | -139.616 |   66.652 | -2.095 | .045
% Food Insecure |    2.785 |     .674 |   .546 | 4.131 | .000
          Rural |     .131 |     .048 |   .404 | 2.701 | .011
         Female |    2.657 |    1.170 |   .372 | 2.270 | .031
           < 18 |    -.416 |     .583 |  -.145 | -.715 | .480
       Hispanic |    1.156 |     .236 |  1.092 | 4.905 | .000

Stata

               Var. |     Coef. | Std.Err. |     t | P>|t|
percentfoodinsecure |   2.76532 | .6741544 |  4.10 | 0.000
              rural |  .1378976 | .0495354 |  2.78 | 0.009
             female |  2.826711 | 1.204272 |  2.35 | 0.026
            under18 | -.3799895 |  .588423 | -0.65 | 0.523
           hispanic |  1.168375 | .2398765 |  4.87 | 0.000
              _cons | -149.3858 |  69.0891 | -2.16 | 0.039
ShannonC
  • 604
  • 4
    "Simple linear regression" does not entail an "enter" method so far as I know. Are you doing something stepwise? How are the data held? In storage types that are exactly comparable? Regardless of that, this is perilously close to "I get slightly different results in different programs". Can you provide reproducible examples with specified datasets and exact commands so that people can replicate results in either program, or indeed any other program? – Nick Cox Aug 26 '14 at 16:52
  • Thanks. Those who know both programs (not me; I've not used SPSS this century) may be able to comment. – Nick Cox Aug 26 '14 at 17:05
  • Shoot, that came out much better in the editor. I'll work on making these a little more readable...

    And thank you, @Nick Cox!

    – ShannonC Aug 26 '14 at 17:19
  • Stata is consistent with itself, then. Still, that isn't the question. I stand aside here: as said, I don't use SPSS. – Nick Cox Aug 26 '14 at 17:49
  • 4
    For what it is worth, R is producing results like Stata. – russellpierce Aug 26 '14 at 17:58
  • 4
    From what you've pasted, it's the same analysis (I've used both SPSS and Stata). It's almost certain that the data are different. Run things like means, frequencies or correlations and make sure the results are the same in both. It's possible that something weird happened with variable names when you pasted. Save the data from Stata (in Stata format) and open it in SPSS. – Jeremy Miles Aug 26 '14 at 18:01
  • 2
    It's possible that commands you've run previously on SPSS carry over - for e.g. weight, split file, or select. Check that. Also make sure that you haven't defined a value in the data as missing. – Jeremy Miles Aug 26 '14 at 18:05
  • Weird. I just ran it in Excel, because that's what the data download to, and I get the SPSS results. When I run it in R, I get the Stata results. – Jeremy Miles Aug 26 '14 at 18:12
  • One more thing. SPSS appears to be giving a standardized estimate for Hispanic of 1.092. That's not impossible, but it's unusual if your variables aren't very highly correlated. – Jeremy Miles Aug 26 '14 at 18:27
  • The highest correlation here (Stata again) is 0.79 between Hispanic and under18. – Nick Cox Aug 26 '14 at 18:37
  • 3
    I'm Stata-proficient but learning SPSS for my new position. My condolences – StasK Aug 26 '14 at 19:00
  • Thanks for the input! So I guess it's Stata and R vs. SPSS and Excel!

    Tried to save my Stata data for use in SPSS, and even using saveold, it's not creating an old enough version for SPSS, so I need to work that out. Does anyone know how to force-save it to an even older version?

    Been looking at all the SPSS settings; seem to be clean, but it's not my area of expertise. I could've missed something.

    Hispanic and under18 are correlated; this is just a sample reg from an old dataset to practice SPSS.

    Thanks for all the input! Going to keep trying.

    *edit: looks like @whuber has an answer!

    – ShannonC Aug 26 '14 at 20:02
  • If copy and paste into Stata are wrong, there is no point in exporting that over-rounded data anywhere else. What I did not try -- and evidently no one else tried either -- in Stata was using import excel rather than copy and paste. – Nick Cox Aug 26 '14 at 21:37
  • @Nick I believe the problem lies primarily with Excel, not Stata or SPSS or R. At least in older versions of Windows, only plain text format was placed in the Clipboard. Excel chose to copy the text as it was formatted rather than trying to find some universal representation for the data as stored (which is next to impossible anyway). In newer versions of Windows several data structures can be placed in the Clipboard, eliminating this problem, but I don't know the details for recent versions of Excel. – whuber Aug 27 '14 at 14:14
  • @whuber Thanks; that matches my understanding. My wording should perhaps have been: If the results of copying and pasting from Excel to any program are wrong, there is no point exporting those over-rounded results into a third program. – Nick Cox Aug 27 '14 at 14:18
  • 2
    @Nick I think there may be even subtler (and more universal) issues to contend with. What happened here also happens when people transcribe data into reports or papers: they get rounded and some precision is lost. If one wants to publish reproducible results, then (arguably) the analyses should be based on the data as published rather than the data as represented internally in the computer. Although one would hope the published data and computerized data are the same, they often won't be unless the data are made electronically accessible. – whuber Aug 27 '14 at 14:25
  • 2
    I agree, not least because even 0.1 may be held differently depending on the details of how a dataset is stored. – Nick Cox Aug 27 '14 at 14:31
  • The link to the data is dead. Please supply a more stable data format. – Glen_b Aug 11 '15 at 02:41

2 Answers2

16

The problem (amazingly) has to do with rounding the values during pasting.

In Excel, most of the values were computed elsewhere and are recorded as doubles (about 16 decimal places of precision). Only % Food Insecure actually is stored to a small number of decimal places (one). None of the data columns is stored as it appears in Excel. During pasting, the receiving application typically will accept the data as they appear, not as they are actually stored!

Rounding of data matters in this situation because for several variables--especially percent female and percent food insecurity--the amounts rounded off can be an appreciable fraction of the standard deviation of the data.

When I read the Excel data directly in R using xlsx::read.xlsx, I reproduce the SPSS results exactly. When I round the data to integers (for % Free Lunch) and to one decimal place for the others--as they appear when pasting them into R--I get new results, but the estimated coefficients change appreciably. For instance, the intercept of $-139.616$ becomes $-133.897$.

I have not been able to reproduce the Stata results in R (my summary statistics do not quite agree with those presented by Nick Cox: my mean for % Food insecure is $15.67$ instead of $15.81$), but I suspect that if I were to paste them into my copy of Stata, I would get the reported Stata results. (A big clue is the rounded values presented for the minima and maxima: in most cases these are not the minima and maxima actually recorded in the Excel file.)

The differences between the two sets of results are a small fraction of a standard error, so they are--in this statistical sense--of no consequence.

There is no collinearity problem: the VIFs are nice and low.

Moral

When you care about your data, read them directly: do not intervene manually via copy-and-paste or transcription.

whuber
  • 322,774
  • 3
    I can support it. I have no Stata, but when I read the Excel data in SPSS, the results were exactly as the OP gave. When I pasted the data into SPSS, the values got rounded to the number of digital places displayed in Excel; the results were very close to "Stata results". – ttnphns Aug 26 '14 at 19:32
  • 3
    ... and not just rounding of values during pasting. Saving the excel sheet as a csv in Excel saves the visible number of decimal points (the approach I used to get R to produce 'Stata results'). – russellpierce Aug 26 '14 at 19:40
  • 1
    YOU ARE AN ANGEL! Thank you! I will try to do it in both again with attention to that, but I bet that solves it! – ShannonC Aug 26 '14 at 20:04
  • 2
    Solved! Results=identical now. Never would've occurred to me that would be the source. Thank you, everyone! – ShannonC Aug 26 '14 at 20:24
  • 2
    It definitely was a group effort, as the comments and replies attest. – whuber Aug 26 '14 at 21:15
  • 1
    Absolutely. I can already tell this is a great place to be! I don't want to admit how many hours I spent trying to get this right before asking...again, thank you to all! – ShannonC Aug 26 '14 at 22:04
5

This isn't really an answer, but there is no easy way to show output except here.

This is what I got in Stata 13, after copy and paste and some renaming.

There are no missing values on the variables used, so what either program might do with missing values is irrelevant here.

2/6 variables are presented as integers; 4/6 are presented with one decimal place. Note that the exact route I followed has the consequence that variables with decimal places are held as double. Other routes might produce variables of float type.

Check that 36 observations (cases) are included in both regressions.

Please check whether you got the same.

. d percentfreelunch percentfoodinsecure rural female under18 hispanic

                 storage   display    value
   variable name   type    format     label      variable label
   -----------------------------------------------------------------------------------------
   percentfreelu~h byte    %10.0g                % Free Lunch
   percentfoodin~e byte    %10.0g                % Food Insecure
   rural           double  %10.0g                Rural
   female          double  %10.0g                Female
   under18         double  %10.0g                < 18
   hispanic        double  %10.0g                Hispanic


   . reg percentfreelunch percentfoodinsecure rural female under18 hispanic

        Source |       SS       df       MS              Number of obs =      36
  -------------+------------------------------           F(  5,    30) =    8.27
         Model |  1578.09245     5   315.61849           Prob > F      =  0.0001
      Residual |  1145.12977    30  38.1709925           R-squared     =  0.5795
  -------------+------------------------------           Adj R-squared =  0.5094
         Total |  2723.22222    35  77.8063492           Root MSE      =  6.1783

  -------------------------------------------------------------------------------------
     percentfreelunch |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
  --------------------+----------------------------------------------------------------
  percentfoodinsecure |    2.76532   .6741544     4.10   0.000     1.388513    4.142127
                rural |   .1378976   .0495354     2.78   0.009     .0367327    .2390624
               female |    2.82671   1.204272     2.35   0.026     .3672593    5.286162
              under18 |  -.3799895    .588423    -0.65   0.523     -1.58171    .8217307
             hispanic |   1.168375   .2398765     4.87   0.000     .6784822    1.658269
                _cons |  -149.3857   69.08908    -2.16   0.039    -290.4845   -8.287025
  -------------------------------------------------------------------------------------

EDIT: Some summary statistics from Stata

. su percentfreelunch percentfoodinsecure rural female under18 hispanic, sep(0)

     Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
 percentfre~h |        36    46.27778    8.820791         29         72
 percentfoo~e |        36    15.80556    1.785968         12         20
        rural |        36    44.37222    27.25343        1.3        100
       female |        36    49.95833    1.233201       45.9       51.5
      under18 |        36    21.26389    3.069216       15.6         28
     hispanic |        36    10.86944    8.315178        2.3       32.2
Nick Cox
  • 56,404
  • 8
  • 127
  • 185
  • I checked and all cases are included in both models. I have the exact same Stata results as you. I changed all the variables to float type, and the results are the same. – ShannonC Aug 26 '14 at 17:46
  • 4
    When I import the data into SPSS none of the variables are integers. All of the ones used here are floating point except for the FoodInsecure - which is F3.1 (in Fortran style number formats). This causes the variables to all have slightly different summary statistics in SPSS compared to here. Here is SPSS code to replicate. So it is likely the two programs input the Excel data differently - which is correct I'm unsure. – Andy W Aug 26 '14 at 18:25
  • @AndyW I think you found the answer. SPSS got it right. – whuber Aug 26 '14 at 19:26
  • 1
    Thank you, @AndyW and @whuber! Got it working perfectly. :-) – ShannonC Aug 26 '14 at 20:22