-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-25839 Bulk Import fails with java.io.IOException: Type mismatch in value from map #3232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
offermannu
commented
May 5, 2021
- CellSortImporter emits values wrapped inside MapReduceExtendedCell
- CellWritableComparable serializes given cells using standard KeyValue serialization methods so that it fits the existing deserilaization method
See https://issues.apache.org/jira/browse/HBASE-25839 * CellSortImporter emits values wrapped inside MapReduceExtendedCell * CellWritableComparable serializes given cells using standard KeyValue serialization methods in order to fit with the used deserialization method
Update Import.java
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
out.writeInt(PrivateCellUtil.estimatedSerializedSizeOfKey(kv)); | ||
out.writeInt(0); | ||
PrivateCellUtil.writeFlatKey(kv, out); | ||
KeyValueUtil.write(new KeyValue(kv), out); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PrivateCellUtil.writeFlatKey
writes the key value in a different format from KeyValueUtil.write
. This breaks compatibility, can we just keep writing in the same format as before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever CellWritableComparable.write()
writes must be compatible with CellWritableComparable.readFields
which is currently not the case (see this exception in HBASE-25839. The phrase "keyLength=0" inside the exception message comes actually from the statement out.writeIn(0)
in line 139).
My proposal aligns the "write" with the existing CellWritableComparable.readFields
. Theoretically one can adjust CellWritableComparable.readFields
so that it becomes compatible with the current write
method but this looks more complicated to me.
AFAICS the key is only used during the mapper sorting phase (see stacktrace). The reducer doesn't care about the key at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are basically reverting changes from commit 0a24178 from 4 years ago (HBASE-18649). Can you get rid of that exception if you just remove the out.writeInt(0)
there? I'm not sure, though, on other potential side effects. Maybe we would also need to change Import.readFields
to stay aligned with HBASE-18649 (I guess we could use CellBuilder
to construct a Cell
directly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this, but it doesn't work unfortunately. If you look at the readFields
method you'll see that it calls KeyValue.create(DataInput in)
which in turn calls KeyValue.create(int length, DataInput in)
. The length is read as integer from the DataInput which kind of matches what write
has written before in line 138 - so far so good (I'm refering to original version btw).
Honestly I'm a little bit confused at this point and don't know whether length
expects the key length or the length of the whole key-value pair. You'll see that KeyValue
allocates a byte array of length length
(line 2266) and so I guess that it must be the length of key+value.
Now keep in mind that the first 4 bytes are consumed from DataInput (the length) and KeyValue.create
calls:
in.readFully(bytes); return new KeyValue(bytes, 0, length);
- thus bytes
starts with four '0' bytes coming from Import.write line 139: out.writeInt(0);
!
The KeyValue constructor in turn calls KeyValueUtil.checkKeyValueBytes
. And here I discovered that the serialization can never match the deserialization because KeyValueUtil.checkKeyValueBytes
expects the key len at the beginning of the bytes
buffer (note that offset==0
) but this is zero as mentioned before. The check finally results into the exception: Error: java.lang.IllegalArgumentException: Invalid key length in KeyValue. keyLength=0, KeyValueBytesHex=\x00\x00\x00\x00\x00Lc30f5d93[...], offset=0, length=97
To sum it up:
CellWritableComparable.write
writes:
[ [4 bytes key length], [4 bytes '0'], [...] ]
CellWritableComparable.readFields
expects:
[ [4bytes key-value length], [4 bytes key length], [...] ]
Honestly, I wonder how that ever worked ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so how about the other option I mentioned in my last comment:
Maybe we would also need to change Import.readFields to stay aligned with HBASE-18649 (I guess we could use CellBuilder to construct a Cell directly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the implementation enough to evaluate the alternatives. As far as I can tell CellWritableComparable.readFields
uses kind of standard deserialization methods while CellWritableComparable.write
implements its own serialization methods.
But before we find out what is the best way we should ask ourselfs if this feature is used at all. Since the code is broken for at least 4 years it seems to me that nobody uses the large import feature.
I ran into this problem because we had to import 2,5 billion records into a new HBase/Hadoop cluster. The direct import failed because the regionserver was flooded with write requests and we got lots of RegionTooBusyException
s. So we tried the bulk import and ran into this bugs which I tried to fix. Unfortunately it didn't work either because YARN produced too many cache files for local sorting and the disks ran full - but that's another story.
Finally we found a way to get the direct import working again which means I don't have a testing environment anymore.
IMHO if the feature is not used at all and there is no test available, it would be smartest to delete the feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Patch in HBASE-18649 changed this part and made to write only the key part of the cell (rowkey + cf + q + ts + type) instead of the entire KV. So it writes key length
out.writeInt(PrivateCellUtil.estimatedSerializedSizeOfKey(kv));
And then value len as 0
and then key part.
But this is not compatible with counter part read where it expects a prefix int part which says the overall length of this KV (Ya what we wrote is still a KV , or call it Cell, where value is empty).
So I agree to the fact that this never worked.
We can not get rid of out.writeInt(0).
On the other hand we have to write this entire KV length what we are going to write. This is key len + value len(0) + 4 + 4 (we write these extra 4 bytes twice for these 2 lengths too)
int keyLen = CellUtil.estimatedSerializedSizeOfKey(kv);
int valueLen = 0; // We avoid writing value here. So just serialize as if an empty value.
out.writeInt(keyLen+ valueLen + KeyValue.KEYVALUE_INFRASTRUCTURE_SIZE);
out.writeInt(keyLen);
out.writeInt(valueLen);
CellUtil.writeFlatKey(kv, out);
Not sure whether we already have some util method for this. Else worth adding?
Above will work @offermannu ?