Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: read_sas, to_sas #4052

Closed
benjello opened this issue Jun 27, 2013 · 82 comments
Closed

ENH: read_sas, to_sas #4052

benjello opened this issue Jun 27, 2013 · 82 comments
Labels
Enhancement IO Data IO issues that don't fit into a more specific label IO SAS SAS: read_sas
Milestone

Comments

@benjello
Copy link
Contributor

It would be really convenient to be able to at least import SAS tables into pandas dataframe. Is this planned ? Are they insurmountable issues ?

Thanks

@jreback
Copy link
Contributor

jreback commented Jun 27, 2013

can you post a link to the format? and see if any converters have been writtenin python?

obviously the idea would be to read the native format file

@cpcloud
Copy link
Member

cpcloud commented Jun 27, 2013

@jreback
Copy link
Contributor

jreback commented Jun 27, 2013

so looks like simple binary read/write stuff....could be done....only other question is there any license issue with doing this?

@cpcloud
Copy link
Member

cpcloud commented Jun 27, 2013

not sure. @benjello want to contact SAS and ask them?

@miketkelly
Copy link

That technote is describing the XPORT format, which isn't the native binary format. I've never seen a published layout of the native format. Some people have partially reverse engineered it, but I've never seen a solution that could handle data sets with compression.

https://github.com/BioStatMatt/sas7bdat

@cpcloud
Copy link
Member

cpcloud commented Jun 27, 2013

@mtkni thanks had no idea.

@jreback
Copy link
Contributor

jreback commented Jun 27, 2013

aside from using SAS to actually export (e.g. csv or whatever), is there aformat that one could save that provides some interoperbility (and is openish)?

@cpcloud
Copy link
Member

cpcloud commented Jun 27, 2013

a 10 minute search suggests no but maybe someone else knows more.

@benjello
Copy link
Contributor Author

I am not a specialist much more a potential user in heavy need of such a tool.
For now I have used alternatively StatTransfer which is not a free software
or when importing to R I used one of the method exposed here but you need to have sas installed.
I am sorry for not being able to provide ypu for more information than above.

@cpcloud
Copy link
Member

cpcloud commented Jun 27, 2013

probably could do

  1. call to sas if exists
  2. if doesn't exist can only read xport so try to read that (someone would need to implement the xport reader)
  3. if not 1 or 2 then fail saying u need sas to read sas binary files.

@jreback
Copy link
Contributor

jreback commented Jun 27, 2013

I don't think 1) is a good idea

export data in stata to xport format or csv format

@cpcloud
Copy link
Member

cpcloud commented Jun 27, 2013

ok. just throwing it out there. i don't like calling out to other programs either, but this seems like it's going to be tough. i can implement the R code above...if u think that's a good idea...but it basically forces users to use that particular version of the format and if it ever changes we won't know until it breaks.

@cpcloud
Copy link
Member

cpcloud commented Jun 27, 2013

i wouldn't be able to test to_sas though since i don't have sas

@benjello
Copy link
Contributor Author

I would be glad to test everything that would do the job. I have sas.

@miketkelly
Copy link

I've spent some time on this in the past. These are my thoughts:

  1. Reverse engineering the binary data set format is a difficult task and not a good priority for this project. Any solution, like the R solution, that doesn't deal with compressed data sets won't do me any good. I can't speak for others on that. I also worry about the licensing issues.
  2. Given that, the only option for read_sas may be to use the sas executable to first export SAS to a format that can be read by Python.
  3. The SAS transport format, as described in the tech note, is an option but it has some quirks. In particular, the native XPORT engine doesn't handle long variable names. It really hasn't evolved since SAS 6 (we're on SAS 9 now). The SAS workaround for long variable names is a set of macros that not everyone has installed.
  4. I'm not convinced the XPORT format is a better option than just CSV. If it's dramatically faster it might be. I'll study that next week.
  5. If the mechanism for reading SAS data sets turns out to be export to CSV and then using read_csv, then there's not much code to write that isn't SAS environment dependent. It just doesn't make for a good shared module.

As much as I wish there was a good solution to this, and as much as I'd be willing to help build it, I just don't think there is. I've built read_sas using CSV as an intermediate format. Obviously, this requires a SAS license. It takes very few lines to implement, but most of those lines are specific to our SAS environment and are not well-portable.

I will study the performance of XPORT vs CSV next week. If it's dramatically faster, then it may be worth the effort to implement. Even then, I'm not sure it's worth taking that on as part of the Pandas project. I would be interested in comments from other SAS users on that.

Just my two cents.

@cpcloud
Copy link
Member

cpcloud commented Jun 29, 2013

nice to hear from someone who tried to do this. FWIW i think it might be tough to beat CSV for speed, most of it is written C/Cython.

@miketkelly
Copy link

Agreed. The new, fast CSV was a game changer.

@jreback
Copy link
Contributor

jreback commented Jun 29, 2013

can sas export to HDF5?

@miketkelly
Copy link

No, it can't.

@jreback
Copy link
Contributor

jreback commented Jun 29, 2013

export in STATA format?

@miketkelly
Copy link

No, and if it did it would be an expensive add-on module. SAS is pretty good at reading from databases (http://www.sas.com/resources/factsheet/sas-access-factsheet.pdf), although each database platform is a separate license. I haven't found it good at all at writing to databases (it can, but it's slow). Other than that, interoperability doesn't seem to be part of their business model.

@miketkelly
Copy link

Oh wait, I may have spoken to soon. Apparently I can export to a stata file: http://support.sas.com/documentation/cdl/en/acpcref/63184/HTML/default/viewer.htm#a003102702.htm

Is STATA supported in Pandas? It would still requires a SAS license, but I can benchmark that versus CSV.

@benjello
Copy link
Contributor Author

There is a read_stata that will be available in te coming version but already available on github

@benjello
Copy link
Contributor Author

BTW, @mtkni I would be happy to look at the read_sas you implemented if you would share it ...

@miketkelly
Copy link

Just to close the loop on this, exporting to STATA requires an add-on for which I'm not licensed, so I can't benchmark it.

@dramage1
Copy link

dramage1 commented Nov 3, 2013

FYI, the XPT or transport format is a non-proprietary format that has no licensing issues and is the only format currently accepted by the Food and Drug Administration (FDA) for clinical trial data. Most pharmaceutical companies submit XPT format to the FDA. It would be nice to have a way to read these files in just like a csv file.

@jtratner
Copy link
Contributor

jtratner commented Nov 3, 2013

@dramage1 Is "XPT" the same as "XPORT" above?

@jtratner
Copy link
Contributor

jtratner commented Nov 3, 2013

Heyo - there's at least one Python package for reading XPT files - https://pypi.python.org/pypi/xport/0.1.0

@dramage1
Copy link

dramage1 commented Nov 3, 2013

Just what I needed, much appreciated.

@jtratner
Copy link
Contributor

jtratner commented Nov 3, 2013

@dramage1 if you use this enough to want to write up a pandas wrapper for it, that could be a useful addition to pandas (depending on the stability of xport)

@tyler-abbot
Copy link

@jreback I'm actually working on transcribing the SAScii (http://cran.r-project.org/web/packages/SAScii/index.html) package from R to Python. I would be happy to share the results if it is ok with the author of that package. I haven't done much development, though, so don't know much about sop's. I'm also not sure how compatible it would be with the library you mentioned. Perhaps just an add on with the option of ...format="sas"... or something along those lines.

@jreback
Copy link
Contributor

jreback commented Jun 18, 2015

just saw this. https://pypi.python.org/pypi/sas7bdat/2.0.1. Even if this is pure-python (slower), that is ok to start. Better to have it able to read than not.

cc @kshedden

@benjello
Copy link
Contributor Author

@jreback @kshedden I use extensively https://pypi.python.org/pypi/sas7bdat/2.0.1 It is slow but works well.
Since usually I transfer my data to HDF format once at the beginning of a study, it was worth it.

@tyler-abbot
Copy link

So, I don't think the sas7bdat package can read the type of sas files I'm talking about. I have finished writing a function to do it, but am going out of town for a few months. It is all contained in this package:

https://pypi.python.org/pypi/psid_py

I have a few days during which I could work on incorporating the read_sas() function into pandas. I'm going to read through the documentation and do some more testing, but if anyone has suggestions that will help me move more quickly it would be greatly appreciated.

@jreback
Copy link
Contributor

jreback commented Sep 10, 2015

I am going to mark this for 0.17.1. The implementation for using sas7bdat is quite trivial. So should start with that.

@jreback jreback modified the milestones: 0.17.1, Someday Sep 10, 2015
@aflaxman
Copy link
Contributor

I used sas7bdat the other day... here is what it took:

from sas7bdat import SAS7BDAT
with SAS7BDAT('/homes/abie/projects/2015/TICS/tics_07.sas7bdat') as f:
    df = f.to_data_frame()

There is some useful information hidden in the sas file that does not make it into the dataframe, though, such as the column labels.

@jreback
Copy link
Contributor

jreback commented Sep 10, 2015

agreed. all that is really needed are:

  • docs (e.g. sas7bdat will now be an optional import)
  • integration into the pd.read_sas method via the format kw
  • tests (and add to the CI so we can actually tests this), and some sample files
  • verify the dtypes are returned correctly (and if now fix / post fixes upstream)

@jreback jreback modified the milestones: 0.17.1, Next Major Release Oct 27, 2015
@jreback jreback modified the milestones: 0.18.0, Next Major Release Feb 8, 2016
@selik
Copy link
Contributor

selik commented Mar 23, 2016

In case you're curious, I revised the xport module and I think the code is a bit easier to read now. The xport.to_dataframe I wrote is occasionally 2x faster than pandas.read_sas but sometimes much slower. I expect it is dependent on the number of floats in the dataset, as I didn't vectorize the conversion from IBM to IEEE.

@jreback
Copy link
Contributor

jreback commented Mar 23, 2016

@selik well certainly welcome a PR to fixup the pandas versions!

note that this could be MUCH faster if it were cythonized, similar to how #12656 is done.

@selik
Copy link
Contributor

selik commented Mar 23, 2016

@jreback I'd want to change the API. So long as pandas.read_sas behaves the same, is there room to change the behavior of things like XportReader?

@jreback
Copy link
Contributor

jreback commented Mar 23, 2016

@selik what do you need to change? the user API is very simple actually, just pd.read_sas(..., format='....') and chunking. As long as you don't mess with that prob ok.

@selik
Copy link
Contributor

selik commented Mar 23, 2016

@jreback Sounds good. Not sure what I might need to change internally, but the effort is more pleasant if there's more freedom. I'd say I'll get around to it soon, but looking back I suddenly realize it took me 3+ years from the first time I told myself I'd revise the XPORT reader.

@jreback
Copy link
Contributor

jreback commented Mar 23, 2016

sure feel free to take a look around

@selik
Copy link
Contributor

selik commented Aug 25, 2016

Just FYI, I added dump and dumps to the xport module, in case anyone wants to take a look for writing pandas.to_sas.
https://github.com/selik/xport/blob/master/xport.py#L540

The conversion from Python floats to IBM-mainframe 64-bit floats seems to be working quite well, very rarely losing precision. At least when I round-trip from IEEE to IBM and back to IEEE.
https://github.com/selik/xport/blob/master/xport.py#L481

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label IO SAS SAS: read_sas
Projects
None yet
Development

Successfully merging a pull request may close this issue.