Skip to content

Unicode strings unexpectedly transformed to byte strings upon open_dataset #1638

Closed
@olgabot

Description

@olgabot

When I first create the dataset, all the metadata is stored as unicode strings (yay!):

<xarray.Dataset>
Dimensions:                       (cell: 53760, gene: 23438)
Coordinates:
  * gene                          (gene) object '0610005C13Rik' ...
    Uniquely mapped reads number  (cell) int64 1017682 634557 941828 1392029 ...
    Number of input reads         (cell) int64 1229254 730274 1075370 ...
    EXP_ID                        (cell) <U29 '170925_A00111_0066_AH3TKNDMXX' ...
    TAXON                         (cell) <U3 'mus' 'mus' 'mus' 'mus' 'mus' ...
    WELL_MAPPING                  (cell) <U9 'B000126' 'B000126' 'B000126' ...
    Lysis Plate Batch             (cell) <U32 '20' '20' '20' '20' '20' '20' ...
    dNTP.batch                    (cell) <U38 '457912' '457912' '457912' ...
    oligodT.order.no              (cell) <U32 '6/23/17 12757296' ...
    plate.type                    (cell) <U32 'Biorad HSP3901' ...
    preparation.site              (cell) <U32 'Biohub' 'Biohub' 'Biohub' ...
    date.prepared                 (cell) <U32 '07-06-17' '07-06-17' ...
    date.sorted                   (cell) <U6 '170707' '170707' '170707' ...
    tissue                        (cell) <U13 'Skin' 'Skin' 'Skin' 'Skin' ...
    subtissue                     (cell) <U32 'nan' 'nan' 'nan' 'nan' 'nan' ...
    mouse.id                      (cell) <U13 '3_39_F' '3_39_F' '3_39_F' ...
    FACS.selection                (cell) <U52 'Multiple' 'Multiple' ...
    nozzle.size                   (cell) <U32 '100' '100' '100' '100' '100' ...
    FACS.instument                (cell) <U32 'Sony SIM1' 'Sony SIM1' ...
    Experiment ID                 (cell) <U32 'exp22' 'exp22' 'exp22' ...
    Columns sorted                (cell) float64 nan nan nan nan nan nan nan ...
    Double check                  (cell) float64 nan nan nan nan nan nan nan ...
    Plate                         (cell) <U32 '1' '1' '1' '1' '1' '1' '1' ...
    Location                      (cell) <U32 'MACA20_3' 'MACA20_3' ...
    Comments                      (cell) <U32 'nan' 'nan' 'nan' 'nan' 'nan' ...
    mouse.age                     (cell) <U1 '3' '3' '3' '3' '3' '3' '3' '3' ...
    mouse.number                  (cell) <U32 '39' '39' '39' '39' '39' '39' ...
    mouse.sex                     (cell) <U1 'F' 'F' 'F' 'F' 'F' 'F' 'F' 'F' ...
  * cell                          (cell) object 'A17-B000126-3_39_F-1-1' ...
Data variables:
    counts                        (cell, gene) int64 0 0 0 0 442 0 0 0 0 0 0 ...
    log2                          (cell, gene) float64 0.0 0.0 0.0 0.0 8.791 ...
    log10                         (cell, gene) float64 0.0 0.0 0.0 0.0 2.646 ...

but then when I save using to_netcdf using the default arguments, then xr.open_dataset on the same dataset using default arguments, all of them get converted to byte strings:

<xarray.Dataset>
Dimensions:                       (cell: 53760, gene: 23438)
Coordinates:
  * cell                          (cell) |S24 b'A17-B000126-3_39_F-1-1' ...
  * gene                          (gene) |S22 b'0610005C13Rik' ...
Data variables:
    counts                        (cell, gene) int32 0 0 0 0 442 0 0 0 0 0 0 ...
    log2                          (cell, gene) float64 0.0 0.0 0.0 0.0 8.791 ...
    log10                         (cell, gene) float64 0.0 0.0 0.0 0.0 2.646 ...
    FACS.selection                (cell) |S52 b'Multiple' b'Multiple' ...
    dNTP.batch                    (cell) |S38 b'457912' b'457912' b'457912' ...
    EXP_ID                        (cell) |S29 b'170925_A00111_0066_AH3TKNDMXX' ...
    subtissue                     (cell) |S19 b'nan' b'nan' b'nan' b'nan' ...
    oligodT.order.no              (cell) |S17 b'6/23/17 12757296' ...
    plate.type                    (cell) |S14 b'Biorad HSP3901' ...
    tissue                        (cell) |S13 b'Skin' b'Skin' b'Skin' ...
    mouse.id                      (cell) |S13 b'3_39_F' b'3_39_F' b'3_39_F' ...
    FACS.instument                (cell) |S13 b'Sony SIM1' b'Sony SIM1' ...
    Comments                      (cell) |S11 b'nan' b'nan' b'nan' b'nan' ...
    WELL_MAPPING                  (cell) |S9 b'B000126' b'B000126' ...
    date.prepared                 (cell) |S9 b'07-06-17' b'07-06-17' ...
    Location                      (cell) |S9 b'MACA20_3' b'MACA20_3' ...
    preparation.site              (cell) |S8 b'Biohub' b'Biohub' b'Biohub' ...
    date.sorted                   (cell) |S6 b'170707' b'170707' b'170707' ...
    Experiment ID                 (cell) |S6 b'exp22' b'exp22' b'exp22' ...
    TAXON                         (cell) |S3 b'mus' b'mus' b'mus' b'mus' ...
    Lysis Plate Batch             (cell) |S3 b'20' b'20' b'20' b'20' b'20' ...
    nozzle.size                   (cell) |S3 b'100' b'100' b'100' b'100' ...
    Plate                         (cell) |S3 b'1' b'1' b'1' b'1' b'1' b'1' ...
    mouse.number                  (cell) |S3 b'39' b'39' b'39' b'39' b'39' ...
    Uniquely mapped reads number  (cell) int32 1017682 634557 941828 1392029 ...
    Number of input reads         (cell) int32 1229254 730274 1075370 ...
    Columns sorted                (cell) float64 nan nan nan nan nan nan nan ...
    Double check                  (cell) float64 nan nan nan nan nan nan nan ...
    mouse.age                     (cell) |S1 b'3' b'3' b'3' b'3' b'3' b'3' ...
    mouse.sex                     (cell) |S1 b'F' b'F' b'F' b'F' b'F' b'F' ...

So then things I expect like selecting on gene, e.g. ds.sel(gene="Ins1") don't work unless they're byte strings, i.e. ds.sel(gene=b"Ins1") works just fine.

Do you know why this may be happening?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions