Skip to content

How to copy a CSR matrix into a sparsevec column? #127

Closed
@TopCoder2K

Description

@TopCoder2K

I have a sparse vector — the result of applying sklearn's TfidfVectorizer:

<Compressed Sparse Row sparse matrix of dtype 'float64'
        with 4 stored elements and shape (1, 157541)>
  Coords        Values
  (0, 5051)     0.35521903059198523
  (0, 14956)    0.5566306658037382
  (0, 45152)    0.7328483894186835
  (0, 60738)    0.1640578566196061

which I want to copy into a table with a sparsevec column. As far as I understand from the documentation, the correct way to do this is the following:

        with cur.copy(
            "COPY my_table FROM STDIN WITH (FORMAT BINARY)"
        ) as copy:
            copy.set_types(["sparsevec"])
            copy.write_row((SparseVector(the_sparse_vector),))

but this produces an error:

psycopg.errors.DataException: sparsevec indices must not contain duplicates

I've investigated a bit and found this line which uses value.coords[0] (not value.coords[1] for two dimensional input). Is this a bug? What should I do?

Additional information about the example:

  1. The code
print(the_sparse_vector)
the_sparse_vector = the_sparse_vector.tocoo()
print(the_sparse_vector.ndim, the_sparse_vector.shape)
print(the_sparse_vector.coords)
print(the_sparse_vector.data)
print(SparseVector(the_sparse_vector))

outputs:

<Compressed Sparse Row sparse matrix of dtype 'float64'
        with 4 stored elements and shape (1, 157541)>
  Coords        Values
  (0, 5051)     0.35521903059198523
  (0, 14956)    0.5566306658037382
  (0, 45152)    0.7328483894186835
  (0, 60738)    0.1640578566196061
2 (1, 157541)
(array([0, 0, 0, 0], dtype=int32), array([ 5051, 14956, 45152, 60738], dtype=int32))
[0.35521903 0.55663067 0.73284839 0.16405786]
SparseVector({0: 0.1640578566196061}, 157541)
  1. I have
psycopg           3.2.6
psycopg-binary    3.2.6
pgvector          0.4.0
scipy             1.15.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions