Description
openedon Oct 18, 2022
In this run, a kolibri2zim over the full khan-academy in English crashed with
Assertion failed at ../../SOURCE/libzim_release/src/writer/cluster.cpp:253
size[489062] == provider->getSize()[1226905]
/usr/local/lib/python3.8/site-packages/libzim.so.7(_Z15_on_assert_failImmEvPKcS1_S1_T_T0_S1_i+0x1a9) [0x7f29e10d6c69]
/usr/local/lib/python3.8/site-packages/libzim.so.7(+0x197a44) [0x7f29e1103a44]
/usr/local/lib/python3.8/site-packages/libzim.so.7(_ZNK3zim6writer7Cluster13write_contentESt8functionIFvRKNS_4BlobEEE+0xde) [0x7f29e1103b2e]
/usr/local/lib/python3.8/site-packages/libzim.so.7(_ZNK3zim6writer7Cluster5writeEi+0xec) [0x7f29e110430c]
/usr/local/lib/python3.8/site-packages/libzim.so.7(_ZN3zim6writer13clusterWriterEPv+0x111) [0x7f29e1106141]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbbb2f) [0x7f29e0ea3b2f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f29e558cfa3]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f29e532eeff]
terminate called after throwing an instance of 'std::runtime_error'
what():
Assertion failed at ../../SOURCE/libzim_release/src/writer/cluster.cpp:253
size[489062] == provider->getSize()[1226905]
This is due to this assert inside libzim's writer
void Cluster::write_data(writer_t writer) const
{
for (auto& provider: m_providers)
{
ASSERT(provider->getSize(), !=, 0U);
zim::size_type size = 0;
while(true) {
auto blob = provider->feed();
if(blob.size() == 0) {
break;
}
size += blob.size();
writer(blob);
}
ASSERT(size, ==, provider->getSize());
}
}
Code has been modified since (see https://github.com/openzim/libzim/blob/3a9f574d1aa2f722257f195fcdd6874e3517b8c6/src/writer/cluster.cpp#L246) and would generate a RuntimeError
exception instead but the problem is the same: the size written to the ZIM is different from the size returned by the Provider's get_size()
.
Given kolibri2zim only prints debug after addition to the creator, we don't know which Entry caused the issue.
My investigations would point to a funneled file as other types of content are added via string and the size is automatically calculated.
Funneled ones on the other hand are files that we download directly from the Studio into the ZIM using scraperlib's URLItem
.
Looking at the KA DB, I found a single file reported to have the expected size: c142275210f3f6dec3dfbdb1d9836e7b.mp4.
It works as expected when tested individually so my guess would be that there has been a network/server error that cause downloaded content to be a different. Note that we make an initial tiny request to find Size to decide whether we need to download to disk or not.
We could re-run this and hope this was fixed on it own but this sound like it could happen again given the large size of the content.
Fixing this would be difficult though ; this issue happens on a different libzim-handled thread long after we've added it so we can't catch the (libzim8+ only) exception and retry.