Is there a Python module to open SPSS files?

Question

Is there a module for Python to open IBM SPSS (i.e. .sav) files? It would be great if there's something up-to-date which doesn't require any additional dll files/libraries.

possible duplicate of [Exporting to SPSS files in Python Django?](http://stackoverflow.com/questions/6463763/exporting-to-spss-files-in-python-django) If you want there is also a [recipe](http://code.activestate.com/recipes/577650-python-reader-for-spss-sav-files/) on active-state — Bakuriu, Feb 01 '13 at 13:09
Hi, Bakuriu. It's not a duplicate, as I'm not referencing the Django framework, I'm talking about opening, as opposed to exporting/writing a file, and I mentioned the preference for something recent which doesn't require external libraries/dlls. There's some common ground between the questions, but they can elicit different, as well as similar, responses. Thanks for the link, but again, I'm trying to avoid dll files, if possible. — Lamps1829, Feb 01 '13 at 17:23
The other answer cites Django, but it actually has nothing to do with it. Since Exporting requires the ability to write a file, the chances that you can also read it are high. Reading around I strongly believe you have only one choice: use the `.dll` released by IBM. I can't find any open specification for that file format, which means that the only way to read those file is to use IBM's libraries. You can always try to reverse-engineer the format, but that would take much more time and effort. — Bakuriu, Feb 02 '13 at 08:14
Thanks, Bakuriu. It's unfortunate, but as you said, it is looking likely that IBM's .dll release is the thing to use. — Lamps1829, Feb 02 '13 at 12:53

Otto Fajardo · Answer 1 · 2018-08-28T10:28:23.957

36

I have released a python package "pyreadstat" that reads SPSS (sav, zsav and por), Stata and SAS files. It is a wrapper around the C library ReadStat so it is very fast. Readstat is the library used in the back of the R library Haven, which is widely used and very robust.

The package is autocontained. It does not require using R (no need to install an aditional application) and it does not depend on IBM dlls or other external libraries.

For example, in order to read a SPSS sav file you would do:

import pyreadstat

df, meta = pyreadstat.read_sav("/path/to/sav/file.sav")

df is a pandas dataframe. Meta contains metadata such as variable labels or value labels. read_sav reads both sav and zsav (compressed) files. There is also a function read_por for old por (portable) files.

You can find it here: https://github.com/Roche/pyreadstat

edited Aug 28 '18 at 10:28

answered Aug 22 '18 at 10:39

Otto Fajardo

2,348
1
15
23

That is why I love Python. Messed around a lot of places, finally thought lets use Python. And it worked the first time. Thanks. – mradul dubey Jun 13 '21 at 12:58
fantastic library with amazing performance! thanks a lot. – Martin Bucher Jun 22 '21 at 13:40
I am late to this, however I would like to add my two cents. For me, I am working on a remote server, and I occassionally end up breaking things when trying to pip install new packages, etc., so there was the ```readsav``` function in ```scipy.io``` that worked for me, and was already included. The other top answer of using ```pandas.rpy.common``` didn't work for me either, as apparently that wasn't an attribute that ```rpy``` included. – Steven Thomas Nov 08 '21 at 16:09
@StevenThomas notice however that scipy.io.readsav reads and IDL sav file, not an SPSS sav file (the topic of this thread). IDL is a completely different programming environment. Pyreadstat does not read IDL files, only SPSS. – Otto Fajardo Nov 09 '21 at 10:38
@OttoFajardo thanks for the clarification, I didn't realise there were different types of .sav files. Perhaps I should remove my comment? – Steven Thomas Nov 09 '21 at 11:41
many people get confused by that, so let's leave the comment, I would say – Otto Fajardo Nov 09 '21 at 14:06

score 11 · Answer 2 · edited May 23 '17 at 12:17

Depending on what you want to do--process data using R-related commands from rpy2, or switch to Python--the solution provided by @Spacedman on a related thread might easily be adapted to suit your needs.

Otherwise, Pandas includes a convenient wrapper for rpy2. Here is an example of use with Peat and Barton's weights.sav data set:

>>> import pandas.rpy.common as com
>>> filename = "weights.sav"
>>> w = com.robj.r('foreign::read.spss("%s", to.data.frame=TRUE)' % filename)
>>> w = com.convert_robj(w)
>>> w.head()
     ID  WEIGHT  LENGTH  HEADC  GENDER  EDUCATIO              PARITY
1  L001    3.95    55.5   37.5  Female  tertiary  3 or more siblings
2  L003    4.63    57.0   38.5  Female  tertiary           Singleton
3  L004    4.75    56.0   38.5    Male    year12          2 siblings
4  L005    3.92    56.0   39.0    Male  tertiary         One sibling
5  L006    4.56    55.0   39.5    Male    year10          2 siblings

score 9 · Answer 3 · answered Jun 26 '15 at 12:19

9

As a note for people findings this later (like me): pandas.rpyhas been deprecated in the newest versions of pandas (>0.16) as noted here. That page includes information on updating code to use the rpy2 interface.

answered Jun 26 '15 at 12:19

Savage Henry

1,872
3
21
29

3

Thanks for the sharing this. So `com.convert_robj(rdf)` should be replaced with `pandas2ri.ri2py(rdf)`. But what about `com.robj.r('foreign::read.spss("%s", to.data.frame=TRUE)' % filename)`? – Pyderman Mar 28 '16 at 20:42

Sander van den Oord · Answer 4 · 2020-03-06T17:54:59.063

When you have pandas >= 0.25.0 you can now finally just do pd.read_spss():

# you need pandas >= 0.25.0 for this    
import pandas as pd
df = pd.read_spss('your_spss_file.sav')

This has library pyreadstat as a requirement, so you might have to install that first:

pip install pyreadstat

Extra info on the parameters of pd.read_spss():

Parameters
----------
path : string or Path
File path

usecols : list-like, optional
Return a subset of the columns. If None, return all columns.

convert_categoricals : bool, default is True
Convert categorical columns into pd.Categorical.

Returns
-------
DataFrame

score 3 · Answer 5 · answered Feb 03 '13 at 04:52

3

But the benefit of using the IBM libraries is that they get this rather complex binary file format right. They are free, relieve you of the burden of writing code for this format, and the license permits you to redistribute them. What more could you ask?

answered Feb 03 '13 at 04:52

JKP

5,431
12
5

1

I'd ask for ARM support :) – qdot Aug 28 '14 at 17:56
Where we can find IBM libraries? – Tagar Nov 02 '16 at 15:15
You can get them by following the Downloads link on the IBM Predictive Analytics Community site (https://developer.ibm.com/predictiveanalytics/) – JKP Dec 04 '16 at 20:00

score 3 · Answer 6 · answered Sep 05 '16 at 08:56

3

Here're packages you probably interested in

savReaderWriter on Bitbucket
savReaderWriter 3.4.2 in Python Package Index Repo

answered Sep 05 '16 at 08:56

4ilin

133
1
5

score 2 · Answer 7 · answered Nov 02 '16 at 17:17

I had the same question as @Pyderman about how to update this for pandas (>0.16). This is what I came up with:

from rpy2.robjects import pandas2ri, r
filename = 'weights.sav'
w = r('foreign::read.spss("%s", to.data.frame=TRUE)' % filename)
df = pandas2ri.ri2py(w)
df.head()

score 1 · Answer 8 · answered Feb 01 '13 at 13:14

1

Perhaps you may find this useful: http://code.activestate.com/recipes/577811-python-reader-writer-for-spss-sav-files-linux-mac-/

answered Feb 01 '13 at 13:14

SpankMe

808
1
8
22

Thanks, SM, but that module requires an additional dll file, and that's something I'm trying to avoid. Is there a module (preferably one that is recent) that contains all of the necessary functionality without the use of external libraries? – Lamps1829 Feb 01 '13 at 13:28
Not one I'd know about or was able to find using google, sorry. Why is using external library something you cant live with? I suppose you use a lot of them on a daily basis, wether its Python or anything else, inclusing OS. – SpankMe Feb 01 '13 at 13:31
I wouldn't exclude the possibility of using a dll if other options are exhausted, but it's something I'd like to avoid if possible. The less dependencies, the cleaner things are, and the lower the chances for things going wrong. – Lamps1829 Feb 01 '13 at 13:43
1

And the less likely to get it right, Lamps1829. The i/o modules made freely available by IBM for all the platforms that SPSS Statistics runs on use the same code that Statistics itself uses, so they are guaranteed to be in sync. And the Python reader/writer utilities mentioned above also use these libraries. Those libraries get updated as news features are added to the sav file format, too. The R libraries, last time I looked did not get everything right. – JKP Aug 29 '14 at 19:53

score 0 · Answer 9 · edited May 23 '17 at 11:54

0

You could use a python interface to R and then import the data using read.spss in library(foreign).

edited May 23 '17 at 11:54

Community

1
1

answered Apr 15 '13 at 10:40

Jeromy Anglim

32,241
28
113
170

Is there a Python module to open SPSS files?

9 Answers9

Linked