Skip to content

VERY slow deserialization of large objects #413

Open
@bloodcarter

Description

I'm archiving this data:

	std::vector<std::shared_ptr<Lemma>> lemmas;
	std::map<std::string, std::vector<std::shared_ptr<Form>>> forms;

The size of lemmas vector and forms map is ~100,000 each. The problem is deserializing from portable binary takes 30 secs of my Core i5!

		std::ifstream is("dict.cereal", std::ios::binary);
		cereal::PortableBinaryInputArchive iarchive(is); // Create an input archive
		iarchive(lemmas, forms);

Is that normal or what?

Activity

AzothAmmo

AzothAmmo commented on Jun 15, 2017

@AzothAmmo
Contributor

What is the structure of Lemma and Form? If you can't post actual code, can you just describe what they are serializing (and also the sizeof)? Are you using polymorphism?

I'll try and see if I can reproduce this. Our binary serialization should be very fast.

temehi

temehi commented on Sep 7, 2017

@temehi

I am also experiencing a similar problem.
I have the following data to serialize:

    std::unordered_map <uint64_t, std::bitset<60> > my_map;

my_map contains about 8-billion elements, and the binary file saved is around 33 GB on disk. when I deserialize it using

    std::ifstream istrm("map.cereal", std::ios::binary);
    cereal::BinaryInputArchive iarchive(istrm);
    iarchive(my_map);

I takes about 2500 secs. Isn't that a bit slow?

erichkeane

erichkeane commented on Sep 7, 2017

@erichkeane
Contributor

I would say that depends. loading that much data into memory is going to be time consuming either way. having to send that to swap is going to be quite time consuming.

Additionally, with that much data, the unordered_map is going to be re-indexing near-constantly. With that much data indexed by a uint64_t, you are likely better off choosing a different data structure (depending on your distribution of keys).

temehi

temehi commented on Sep 7, 2017

@temehi

Thanks for your reply

... having to send that to swap is going to be quite time consuming.

No need to send to swap, for my particular problem, having enough memory is not an issue.

One way to avoid re-indexing/hashing to call reserve (size_type count); function on the unordered_map object.
If I do that, the loading time goes down to ~1000secs.

erichkeane

erichkeane commented on Sep 7, 2017

@erichkeane
Contributor

Well, ifstream seems to do an additional copy as a part of it as well, so you're copying the data at least 2x. Perhaps consider using something like boost::iostreams::mapped_file. That'll probably save you another few hundred seconds.

Additionally, are you compiling with optimizations on? The cereal code is pretty template heavy, so it benefits extremely well from higher optimization levels. Particularly setting things like -march=native (if that is acceptable).

AzothAmmo

AzothAmmo commented on Sep 7, 2017

@AzothAmmo
Contributor

We can definitely add a call to reserve for unordered_map loads.

Rinkss

Rinkss commented on Jul 31, 2018

@Rinkss

serialization and deserialization of map<int, vector > is very slow. i am passing this as object to the archives . It's taking around 3secs to deserialize 113MB file

Rinkss

Rinkss commented on Jul 31, 2018

@Rinkss

Also similar problem arise when I try using map<int,string>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

      Participants

      @AzothAmmo@temehi@erichkeane@Batodalaev@bloodcarter

      Issue actions

        VERY slow deserialization of large objects · Issue #413 · USCiLab/cereal