-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: SAS7BDAT reader #12015
ENH: SAS7BDAT reader #12015
Conversation
nice! I would put this in a subdir
|
Mislabeled as IO Stata, should be IO SAS |
@kshedden right! |
# Read SAS7BDAT files | ||
# | ||
# Based on code written by Jared Hobbs: | ||
# https://bitbucket.org/jaredhobbs/sas7bdat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there is an existing license, pls add to the LICENSE dir.
hows this going? |
self._current_row_on_page_index = 0 | ||
self._current_row_in_file_index = 0 | ||
|
||
if isinstance(path_or_buf, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try to use pandas.io.common.get_filepath_or_buffer
(handle encoding / urls and such)
Wow! I need this feature very much.) Is it written completely from scratch or using https://bitbucket.org/jaredhobbs/sas7bdat source? I've got a huge performance improvement with the latter moving the most frequently called parts into Cython module |
@Winand thanks for the feedback. This is heavily based on Jared Hobbs' code. The decompressors were totally rewritten in cython partly for performance and partly because there are known issues with the existing implementations. I need to do more testing before this is ready to merge. |
@kshedden i hope it'll be fast, i need to read 1-1.5M rows table. |
I wrote this because I need to read 500M row files. Current timing on a pretty fast server is 11 seconds per 100,000 rows (with 28 columns, a few needing datetime conversion). Could be faster I'm sure. I do need to test encodings. |
Getting closer, still to do:
|
does the SAS file actually have what encoding its in (in the file itself)? ususally this is an external parameter as you need to decode before reading :) though its in binary so I suppose it could be (e.g. self descripting) |
rslt[name] = np.asarray(rslt[name], dtype=np.float64) | ||
if self.convert_dates and (self.column_formats[j] == "MMDDYY"): | ||
epoch = pd.datetime(1960, 1, 1) | ||
rslt[name] = epoch + pd.to_timedelta(rslt[name], unit='d') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is pretty inefficient to do this. much better is something like this.
l = []
for j in range(self.column_count):
# create an array of the appropriate dtype
arr = ....
l.append(Series(arr))
result = pd.concat(l, keys=self.column_names, axis=1)
will have at most 1 copy of everything.
the current method could potentially have a fair number of copies (this is in the current impl of pandas
eventually this will not be the case, but that's in the future :)
In [3]: pd.concat([Series([1,2,3]),Series(['foo','bar','baz'])],keys=['ints','objs'],axis=1)
Out[3]:
ints objs
0 1 foo
1 2 bar
2 3 baz
Yes, the encoding is in the SAS log and in the output of proc contents. And you can specify it when creating a dataset. So I think it must be explicitly in there somewhere. But its offset position is not stated in any of the documents I have seen. |
df.iloc[:, k] = df.iloc[:, k].astype(np.float64) | ||
self.data.append(df) | ||
|
||
def test1(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally try to have nice names for the tests
How are you going to read 500M rows? This would need MUCH RAM. My current workflow: iteratively read 1.5M rows in 200k chunks (18 columns), save chunks to an on-disk storage (homemade, Castra-like) with Categoricals. Then read from storage to a DataFrame. Trying to make Hobbs' code faster i've found out that handling codecs individually is better than
|
@jreback, I don't have any more outstanding things to do here. If you have any comments let me know. |
@@ -0,0 +1,51 @@ | |||
def read_sas(filepath_or_buffer, format=None, index=None, encoding='utf-8', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a description at the top of this file
this will close #4052, though that has a |
(IMHO) as i understand sas7bdat is SAS internal format, it's not widely used to share data. It will require a lot of effort to create sas7bdat writer, but it will be rarely used |
@Winand that's what I thought. once you go pandas you don't go back 😄 |
@kshedden tests look good. just need some pep cleanup (you can use |
612182d
to
0abac6a
Compare
FYI, see 6100a76 I had to rename existing .XPT and .DTA to lowercase (as they weren't installing correctly) and didn't want a mix of things. |
Am I supposed to rebase this PR on 6100a76? I am getting some rename conflicts that I can't figure out how to resolve. |
yes I renamed some XPT to xpt and DTA to dta |
@kshedden so we have the lint filters turned on now- you can see the errors at the bottom of the travis output and locally
|
integers, dates, or categoricals. By default the whole file is read | ||
and returned as a ``DataFrame``. | ||
SAS files only contain two value types: ASCII text and floating point | ||
values (usually 8 bytes but sometimes truncated). For xport files, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double backtick around xport
and SAS7BDAT
so they stand out a bit
@kshedden if you could update would be great |
@jreback can you give me a tip about these test failures? |
pep issues
the other failure I will fix in a moment (xarray updated to a new version) |
@kshedden ok, rebase away and you should be good |
98e75f1
to
7657009
Compare
Working version except for compression reorganized directory structure Added license file for Jared Hobbs code RLE decompression use ndarray instead of bytes RDC decompression Fix byte order swapping fix rebase errors in test_xport Use filepath_or_buffer io function Handle alilgnment correction Revamped testing Add test with unicode strings Add minimal encoding detection Refactor row-processing Add missing test file Unclobber test files Try again to revert accidental changes to test data files Minor changes in response to code review Add SAS benchmarks to ASV Stash changes before rebase refactor following code review Updated io and whatsnew Updates following code review Remove local test modifications Minor changes following code review Remove unwanted test data file Mostly formatting changes following code review Remove two unneeded files Add __init__py
@jreback I think it's ready |
thanks @kshedden awesome enhancement! pls check out the built docs and such (will prob take a few hours) and issue a followup PR if needed. |
Have just tested on big file. This still needs some bug fixes and performance improvements. Comparing to a little modified version of jared jobb's sas7bdat module:
|
This needs more testing, but basically seems to work for reading uncompressed SAS7BDAT files. I will add support for compression in a few days.
closes #4052