ENH: SAS7BDAT reader #12015

kshedden · 2016-01-11T01:22:19Z

This needs more testing, but basically seems to work for reading uncompressed SAS7BDAT files. I will add support for compression in a few days.

closes #4052

jreback · 2016-01-11T01:31:54Z

nice! I would put this in a subdir

io/sas/...

kshedden · 2016-01-12T14:21:33Z

Mislabeled as IO Stata, should be IO SAS

jreback · 2016-01-12T14:42:31Z

@kshedden right!

jreback · 2016-01-12T15:00:12Z

pandas/io/sas/sas7bdat.py

+# Read SAS7BDAT files
+#
+# Based on code written by Jared Hobbs:
+#   https://bitbucket.org/jaredhobbs/sas7bdat


if there is an existing license, pls add to the LICENSE dir.

jreback · 2016-01-19T20:26:55Z

hows this going?

jreback · 2016-01-19T20:28:29Z

pandas/io/sas/sas7bdat.py

+        self._current_row_on_page_index = 0
+        self._current_row_in_file_index = 0
+
+        if isinstance(path_or_buf, str):


try to use pandas.io.common.get_filepath_or_buffer (handle encoding / urls and such)

Winand · 2016-01-25T07:08:41Z

Wow! I need this feature very much.) Is it written completely from scratch or using https://bitbucket.org/jaredhobbs/sas7bdat source? I've got a huge performance improvement with the latter moving the most frequently called parts into Cython module

kshedden · 2016-01-25T14:08:31Z

@Winand thanks for the feedback. This is heavily based on Jared Hobbs' code. The decompressors were totally rewritten in cython partly for performance and partly because there are known issues with the existing implementations. I need to do more testing before this is ready to merge.

Winand · 2016-01-25T14:12:35Z

@kshedden i hope it'll be fast, i need to read 1-1.5M rows table.
Please test codepage 1251 support before merge:-D

kshedden · 2016-01-25T14:26:57Z

I wrote this because I need to read 500M row files. Current timing on a pretty fast server is 11 seconds per 100,000 rows (with 28 columns, a few needing datetime conversion). Could be faster I'm sure. I do need to test encodings.

kshedden · 2016-01-26T23:42:42Z

Getting closer, still to do:

Try to find encoding information in SAS7bdat file
Move _process_byte_array_with_data into the cython file, rename it something more appropriate

jreback · 2016-01-26T23:48:56Z

does the SAS file actually have what encoding its in (in the file itself)? ususally this is an external parameter as you need to decode before reading :) though its in binary so I suppose it could be (e.g. self descripting)

jreback · 2016-01-26T23:54:27Z

pandas/io/sas/sas7bdat.py

+                rslt[name] = np.asarray(rslt[name], dtype=np.float64)
+                if self.convert_dates and (self.column_formats[j] == "MMDDYY"):
+                    epoch = pd.datetime(1960, 1, 1)
+                    rslt[name] = epoch + pd.to_timedelta(rslt[name], unit='d')


it is pretty inefficient to do this. much better is something like this.

l = [] for j in range(self.column_count): # create an array of the appropriate dtype arr = .... l.append(Series(arr)) result = pd.concat(l, keys=self.column_names, axis=1)

will have at most 1 copy of everything.

the current method could potentially have a fair number of copies (this is in the current impl of pandas
eventually this will not be the case, but that's in the future :)

In [3]: pd.concat([Series([1,2,3]),Series(['foo','bar','baz'])],keys=['ints','objs'],axis=1) Out[3]: ints objs 0 1 foo 1 2 bar 2 3 baz

kshedden · 2016-01-26T23:55:29Z

Yes, the encoding is in the SAS log and in the output of proc contents. And you can specify it when creating a dataset. So I think it must be explicitly in there somewhere. But its offset position is not stated in any of the documents I have seen.

jreback · 2016-01-26T23:58:25Z

pandas/io/tests/test_sas7bdat.py

+                    df.iloc[:, k] = df.iloc[:, k].astype(np.float64)
+            self.data.append(df)
+
+    def test1(self):


ideally try to have nice names for the tests

Winand · 2016-01-27T05:47:38Z

How are you going to read 500M rows? This would need MUCH RAM. My current workflow: iteratively read 1.5M rows in 200k chunks (18 columns), save chunks to an on-disk storage (homemade, Castra-like) with Categoricals. Then read from storage to a DataFrame.

Trying to make Hobbs' code faster i've found out that handling codecs individually is better than s.decode('codec_name') (actually i handle cp1251 only)

if encoding_is_cp1251:
    row_elements.append(codecs.charmap_decode(s,encoding_errors,encodings.cp1251.decoding_table)[0])
else: ...

kshedden · 2016-02-03T21:55:38Z

@jreback, I don't have any more outstanding things to do here. If you have any comments let me know.

jreback · 2016-02-08T15:34:57Z

pandas/io/sas/sasreader.py

@@ -0,0 +1,51 @@
+def read_sas(filepath_or_buffer, format=None, index=None, encoding='utf-8',


add a description at the top of this file

jreback · 2016-02-12T15:14:16Z

this will close #4052, though that has a to_sas. does anything like that exist? is it even worth it?

Winand · 2016-02-12T16:06:45Z

(IMHO) as i understand sas7bdat is SAS internal format, it's not widely used to share data. It will require a lot of effort to create sas7bdat writer, but it will be rarely used

jreback · 2016-02-12T16:13:36Z

@Winand that's what I thought. once you go pandas you don't go back 😄

jreback · 2016-02-13T01:17:20Z

@kshedden tests look good. just need some pep cleanup (you can use autopep8 if you want), or fix manually

jreback · 2016-02-13T17:06:29Z

FYI, see 6100a76

I had to rename existing .XPT and .DTA to lowercase (as they weren't installing correctly) and didn't want a mix of things.

kshedden · 2016-02-13T21:43:51Z

Am I supposed to rebase this PR on 6100a76? I am getting some rename conflicts that I can't figure out how to resolve.

jreback · 2016-02-13T21:46:58Z

yes

I renamed some XPT to xpt and DTA to dta

jreback · 2016-02-15T20:07:55Z

@kshedden so we have the lint filters turned on now- you can see the errors at the bottom of the travis output

and locally

flake8 pandas io/sas

jreback · 2016-02-15T20:08:47Z

doc/source/io.rst

-integers, dates, or categoricals.  By default the whole file is read
-and returned as a ``DataFrame``.
+SAS files only contain two value types: ASCII text and floating point
+values (usually 8 bytes but sometimes truncated).  For xport files,


double backtick around xport and SAS7BDAT so they stand out a bit

jreback · 2016-02-17T17:25:38Z

@kshedden if you could update would be great

kshedden · 2016-02-19T00:39:25Z

@jreback can you give me a tip about these test failures?

jreback · 2016-02-19T00:50:11Z

pep issues

pandas/io/sas/__init__.py:2:1: F403 'from pandas.io.sas.api import *' used; unable to detect undefined names
pandas/io/sas/api.py:1:1: F401 'read_sas' imported but unused

the other failure I will fix in a moment (xarray updated to a new version)

jreback · 2016-02-19T01:32:34Z

@kshedden ok, rebase away and you should be good

Working version except for compression reorganized directory structure Added license file for Jared Hobbs code RLE decompression use ndarray instead of bytes RDC decompression Fix byte order swapping fix rebase errors in test_xport Use filepath_or_buffer io function Handle alilgnment correction Revamped testing Add test with unicode strings Add minimal encoding detection Refactor row-processing Add missing test file Unclobber test files Try again to revert accidental changes to test data files Minor changes in response to code review Add SAS benchmarks to ASV Stash changes before rebase refactor following code review Updated io and whatsnew Updates following code review Remove local test modifications Minor changes following code review Remove unwanted test data file Mostly formatting changes following code review Remove two unneeded files Add __init__py

kshedden · 2016-02-20T01:41:06Z

@jreback I think it's ready

jreback · 2016-02-20T18:41:05Z

thanks @kshedden awesome enhancement!

pls check out the built docs and such (will prob take a few hours) and issue a followup PR if needed.

Minor doc fixes following merge of PR #12015. Author: Kerby Shedden <kshedden@umich.edu> Closes #12407 from kshedden/sas7bdat_docfix and squashes the following commits: 8ba57ce [Kerby Shedden] doc fix for sas7bdat

Winand · 2016-04-21T07:13:19Z

Have just tested on big file. This still needs some bug fixes and performance improvements. Comparing to a little modified version of jared jobb's sas7bdat module:

df = sas7bdat.SAS7BDAT(path, encoding='cp1251')
df = df.to_data_frame() #takes 47.8s for [1487559 rows x 17 columns]
...
df = pd.read_sas(path, encoding='cp1251') #takes 116.6 for [1487559 rows x 17 columns]

jreback added Enhancement IO Stata read_stata, to_stata labels Jan 11, 2016

jreback added IO SAS SAS: read_sas and removed IO Stata read_stata, to_stata labels Jan 12, 2016

jreback reviewed Jan 12, 2016
View reviewed changes

jreback reviewed Jan 19, 2016
View reviewed changes

kshedden force-pushed the sas7bdat branch from 76d81d4 to 01ac775 Compare January 24, 2016 18:55

jreback reviewed Jan 26, 2016
View reviewed changes

jreback changed the title ~~WIP: SAS7BDAT reader~~ ENH: SAS7BDAT reader Feb 8, 2016

jreback added this to the 0.18.0 milestone Feb 8, 2016

jreback reviewed Feb 8, 2016
View reviewed changes

kshedden force-pushed the sas7bdat branch 2 times, most recently from 612182d to 0abac6a Compare February 13, 2016 02:43

kshedden force-pushed the sas7bdat branch from 0abac6a to 8e5a2db Compare February 14, 2016 02:11

jreback reviewed Feb 15, 2016
View reviewed changes

kshedden force-pushed the sas7bdat branch 2 times, most recently from 98e75f1 to 7657009 Compare February 19, 2016 02:20

kshedden force-pushed the sas7bdat branch from 017464c to 5552817 Compare February 19, 2016 13:24

jreback closed this in 23810e5 Feb 20, 2016

kshedden mentioned this pull request Feb 21, 2016

Doc fix for sas7bdat #12407

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: SAS7BDAT reader #12015

ENH: SAS7BDAT reader #12015

kshedden commented Jan 11, 2016

jreback commented Jan 11, 2016

kshedden commented Jan 12, 2016

jreback commented Jan 12, 2016

jreback Jan 12, 2016

jreback commented Jan 19, 2016

jreback Jan 19, 2016

Winand commented Jan 25, 2016

kshedden commented Jan 25, 2016

Winand commented Jan 25, 2016

kshedden commented Jan 25, 2016

kshedden commented Jan 26, 2016

jreback commented Jan 26, 2016

jreback Jan 26, 2016

kshedden commented Jan 26, 2016

jreback Jan 26, 2016

Winand commented Jan 27, 2016

kshedden commented Feb 3, 2016

jreback Feb 8, 2016

jreback commented Feb 12, 2016

Winand commented Feb 12, 2016

jreback commented Feb 12, 2016

jreback commented Feb 13, 2016

jreback commented Feb 13, 2016

kshedden commented Feb 13, 2016

jreback commented Feb 13, 2016

jreback commented Feb 15, 2016

jreback Feb 15, 2016

jreback commented Feb 17, 2016

kshedden commented Feb 19, 2016

jreback commented Feb 19, 2016

jreback commented Feb 19, 2016

kshedden commented Feb 20, 2016

jreback commented Feb 20, 2016

Winand commented Apr 21, 2016

		@@ -0,0 +1,51 @@
		def read_sas(filepath_or_buffer, format=None, index=None, encoding='utf-8',

ENH: SAS7BDAT reader #12015

ENH: SAS7BDAT reader #12015

Conversation

kshedden commented Jan 11, 2016

jreback commented Jan 11, 2016

kshedden commented Jan 12, 2016

jreback commented Jan 12, 2016

jreback Jan 12, 2016

Choose a reason for hiding this comment

jreback commented Jan 19, 2016

jreback Jan 19, 2016

Choose a reason for hiding this comment

Winand commented Jan 25, 2016

kshedden commented Jan 25, 2016

Winand commented Jan 25, 2016

kshedden commented Jan 25, 2016

kshedden commented Jan 26, 2016

jreback commented Jan 26, 2016

jreback Jan 26, 2016

Choose a reason for hiding this comment

kshedden commented Jan 26, 2016

jreback Jan 26, 2016

Choose a reason for hiding this comment

Winand commented Jan 27, 2016

kshedden commented Feb 3, 2016

jreback Feb 8, 2016

Choose a reason for hiding this comment

jreback commented Feb 12, 2016

Winand commented Feb 12, 2016

jreback commented Feb 12, 2016

jreback commented Feb 13, 2016

jreback commented Feb 13, 2016

kshedden commented Feb 13, 2016

jreback commented Feb 13, 2016

jreback commented Feb 15, 2016

jreback Feb 15, 2016

Choose a reason for hiding this comment

jreback commented Feb 17, 2016

kshedden commented Feb 19, 2016

jreback commented Feb 19, 2016

jreback commented Feb 19, 2016

kshedden commented Feb 20, 2016

jreback commented Feb 20, 2016

Winand commented Apr 21, 2016