Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: SAS7BDAT reader #12015

Closed
wants to merge 1 commit into from
Closed

ENH: SAS7BDAT reader #12015

wants to merge 1 commit into from

Conversation

kshedden
Copy link
Contributor

This needs more testing, but basically seems to work for reading uncompressed SAS7BDAT files. I will add support for compression in a few days.

closes #4052

@jreback
Copy link
Contributor

jreback commented Jan 11, 2016

nice! I would put this in a subdir

io/sas/...

@jreback jreback added Enhancement IO Stata read_stata, to_stata labels Jan 11, 2016
@kshedden
Copy link
Contributor Author

Mislabeled as IO Stata, should be IO SAS

@jreback jreback added IO SAS SAS: read_sas and removed IO Stata read_stata, to_stata labels Jan 12, 2016
@jreback
Copy link
Contributor

jreback commented Jan 12, 2016

@kshedden right!

# Read SAS7BDAT files
#
# Based on code written by Jared Hobbs:
# https://bitbucket.org/jaredhobbs/sas7bdat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is an existing license, pls add to the LICENSE dir.

@jreback
Copy link
Contributor

jreback commented Jan 19, 2016

hows this going?

self._current_row_on_page_index = 0
self._current_row_in_file_index = 0

if isinstance(path_or_buf, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to use pandas.io.common.get_filepath_or_buffer (handle encoding / urls and such)

@Winand
Copy link
Contributor

Winand commented Jan 25, 2016

Wow! I need this feature very much.) Is it written completely from scratch or using https://bitbucket.org/jaredhobbs/sas7bdat source? I've got a huge performance improvement with the latter moving the most frequently called parts into Cython module

@kshedden
Copy link
Contributor Author

@Winand thanks for the feedback. This is heavily based on Jared Hobbs' code. The decompressors were totally rewritten in cython partly for performance and partly because there are known issues with the existing implementations. I need to do more testing before this is ready to merge.

@Winand
Copy link
Contributor

Winand commented Jan 25, 2016

@kshedden i hope it'll be fast, i need to read 1-1.5M rows table.
Please test codepage 1251 support before merge:-D

@kshedden
Copy link
Contributor Author

I wrote this because I need to read 500M row files. Current timing on a pretty fast server is 11 seconds per 100,000 rows (with 28 columns, a few needing datetime conversion). Could be faster I'm sure. I do need to test encodings.

@kshedden
Copy link
Contributor Author

Getting closer, still to do:

  • Try to find encoding information in SAS7bdat file
  • Move _process_byte_array_with_data into the cython file, rename it something more appropriate

@jreback
Copy link
Contributor

jreback commented Jan 26, 2016

does the SAS file actually have what encoding its in (in the file itself)? ususally this is an external parameter as you need to decode before reading :) though its in binary so I suppose it could be (e.g. self descripting)

rslt[name] = np.asarray(rslt[name], dtype=np.float64)
if self.convert_dates and (self.column_formats[j] == "MMDDYY"):
epoch = pd.datetime(1960, 1, 1)
rslt[name] = epoch + pd.to_timedelta(rslt[name], unit='d')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is pretty inefficient to do this. much better is something like this.

l = []
for j in range(self.column_count):
    # create an array of the appropriate dtype
    arr = ....
    l.append(Series(arr))
result = pd.concat(l, keys=self.column_names, axis=1)

will have at most 1 copy of everything.

the current method could potentially have a fair number of copies (this is in the current impl of pandas
eventually this will not be the case, but that's in the future :)

In [3]: pd.concat([Series([1,2,3]),Series(['foo','bar','baz'])],keys=['ints','objs'],axis=1)
Out[3]: 
   ints objs
0     1  foo
1     2  bar
2     3  baz

@kshedden
Copy link
Contributor Author

Yes, the encoding is in the SAS log and in the output of proc contents. And you can specify it when creating a dataset. So I think it must be explicitly in there somewhere. But its offset position is not stated in any of the documents I have seen.

df.iloc[:, k] = df.iloc[:, k].astype(np.float64)
self.data.append(df)

def test1(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally try to have nice names for the tests

@Winand
Copy link
Contributor

Winand commented Jan 27, 2016

How are you going to read 500M rows? This would need MUCH RAM. My current workflow: iteratively read 1.5M rows in 200k chunks (18 columns), save chunks to an on-disk storage (homemade, Castra-like) with Categoricals. Then read from storage to a DataFrame.


Trying to make Hobbs' code faster i've found out that handling codecs individually is better than s.decode('codec_name') (actually i handle cp1251 only)

if encoding_is_cp1251:
    row_elements.append(codecs.charmap_decode(s,encoding_errors,encodings.cp1251.decoding_table)[0])
else: ...

@kshedden
Copy link
Contributor Author

kshedden commented Feb 3, 2016

@jreback, I don't have any more outstanding things to do here. If you have any comments let me know.

@jreback jreback changed the title WIP: SAS7BDAT reader ENH: SAS7BDAT reader Feb 8, 2016
@jreback jreback added this to the 0.18.0 milestone Feb 8, 2016
@@ -0,0 +1,51 @@
def read_sas(filepath_or_buffer, format=None, index=None, encoding='utf-8',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a description at the top of this file

@jreback
Copy link
Contributor

jreback commented Feb 12, 2016

this will close #4052, though that has a to_sas. does anything like that exist? is it even worth it?

@Winand
Copy link
Contributor

Winand commented Feb 12, 2016

(IMHO) as i understand sas7bdat is SAS internal format, it's not widely used to share data. It will require a lot of effort to create sas7bdat writer, but it will be rarely used

@jreback
Copy link
Contributor

jreback commented Feb 12, 2016

@Winand that's what I thought. once you go pandas you don't go back 😄

@jreback
Copy link
Contributor

jreback commented Feb 13, 2016

@kshedden tests look good. just need some pep cleanup (you can use autopep8 if you want), or fix manually

@kshedden kshedden force-pushed the sas7bdat branch 2 times, most recently from 612182d to 0abac6a Compare February 13, 2016 02:43
@jreback
Copy link
Contributor

jreback commented Feb 13, 2016

FYI, see 6100a76

I had to rename existing .XPT and .DTA to lowercase (as they weren't installing correctly) and didn't want a mix of things.

@kshedden
Copy link
Contributor Author

Am I supposed to rebase this PR on 6100a76? I am getting some rename conflicts that I can't figure out how to resolve.

@jreback
Copy link
Contributor

jreback commented Feb 13, 2016

yes

I renamed some XPT to xpt and DTA to dta

@jreback
Copy link
Contributor

jreback commented Feb 15, 2016

@kshedden so we have the lint filters turned on now- you can see the errors at the bottom of the travis output

and locally

flake8 pandas io/sas

integers, dates, or categoricals. By default the whole file is read
and returned as a ``DataFrame``.
SAS files only contain two value types: ASCII text and floating point
values (usually 8 bytes but sometimes truncated). For xport files,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double backtick around xport and SAS7BDAT so they stand out a bit

@jreback
Copy link
Contributor

jreback commented Feb 17, 2016

@kshedden if you could update would be great

@kshedden
Copy link
Contributor Author

@jreback can you give me a tip about these test failures?

@jreback
Copy link
Contributor

jreback commented Feb 19, 2016

pep issues

pandas/io/sas/__init__.py:2:1: F403 'from pandas.io.sas.api import *' used; unable to detect undefined names
pandas/io/sas/api.py:1:1: F401 'read_sas' imported but unused

the other failure I will fix in a moment (xarray updated to a new version)

@jreback
Copy link
Contributor

jreback commented Feb 19, 2016

@kshedden ok, rebase away and you should be good

@kshedden kshedden force-pushed the sas7bdat branch 2 times, most recently from 98e75f1 to 7657009 Compare February 19, 2016 02:20
Working version except for compression

reorganized directory structure

Added license file for Jared Hobbs code

RLE decompression

use ndarray instead of bytes

RDC decompression

Fix byte order swapping

fix rebase errors in test_xport

Use filepath_or_buffer io function

Handle alilgnment correction

Revamped testing

Add test with unicode strings

Add minimal encoding detection

Refactor row-processing

Add missing test file

Unclobber test files

Try again to revert accidental changes to test data files

Minor changes in response to code review

Add SAS benchmarks to ASV

Stash changes before rebase

refactor following code review

Updated io and whatsnew

Updates following code review

Remove local test modifications

Minor changes following code review

Remove unwanted test data file

Mostly formatting changes following code review

Remove two unneeded files

Add __init__py
@kshedden
Copy link
Contributor Author

@jreback I think it's ready

@jreback jreback closed this in 23810e5 Feb 20, 2016
@jreback
Copy link
Contributor

jreback commented Feb 20, 2016

thanks @kshedden awesome enhancement!

pls check out the built docs and such (will prob take a few hours) and issue a followup PR if needed.

@kshedden kshedden mentioned this pull request Feb 21, 2016
jorisvandenbossche pushed a commit that referenced this pull request Feb 22, 2016
Minor doc fixes following merge of PR #12015.

Author: Kerby Shedden <kshedden@umich.edu>

Closes #12407 from kshedden/sas7bdat_docfix and squashes the following commits:

8ba57ce [Kerby Shedden] doc fix for sas7bdat
@Winand
Copy link
Contributor

Winand commented Apr 21, 2016

Have just tested on big file. This still needs some bug fixes and performance improvements. Comparing to a little modified version of jared jobb's sas7bdat module:

df = sas7bdat.SAS7BDAT(path, encoding='cp1251')
df = df.to_data_frame() #takes 47.8s for [1487559 rows x 17 columns]
...
df = pd.read_sas(path, encoding='cp1251') #takes 116.6 for [1487559 rows x 17 columns]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: read_sas, to_sas
3 participants