Skip to content

HBASE-25839 Bulk Import fails with java.io.IOException: Type mismatch in value from map #3232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

offermannu
Copy link

  1. CellSortImporter emits values wrapped inside MapReduceExtendedCell
  2. CellWritableComparable serializes given cells using standard KeyValue serialization methods so that it fits the existing deserilaization method

offermannu added 2 commits May 5, 2021 11:22
See https://issues.apache.org/jira/browse/HBASE-25839
 * CellSortImporter emits values wrapped inside MapReduceExtendedCell
 * CellWritableComparable serializes given cells using standard KeyValue serialization methods in order to fit with the used deserialization method
@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 29s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ master Compile Tests _
+1 💚 mvninstall 3m 59s master passed
+1 💚 compile 0m 50s master passed
+1 💚 checkstyle 0m 21s master passed
+1 💚 spotbugs 0m 47s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 3m 38s the patch passed
+1 💚 compile 0m 48s the patch passed
+1 💚 javac 0m 48s the patch passed
-0 ⚠️ checkstyle 0m 18s hbase-mapreduce: The patch generated 1 new + 15 unchanged - 0 fixed = 16 total (was 15)
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 hadoopcheck 18m 1s Patch does not cause any errors with Hadoop 3.1.2 3.2.1 3.3.0.
+1 💚 spotbugs 0m 55s the patch passed
_ Other Tests _
+1 💚 asflicense 0m 15s The patch does not generate ASF License warnings.
38m 4s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3232/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #3232
Optional Tests dupname asflicense javac spotbugs hadoopcheck hbaseanti checkstyle compile
uname Linux 8286e7369f46 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 90f9864
Default Java AdoptOpenJDK-1.8.0_282-b08
checkstyle https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3232/1/artifact/yetus-general-check/output/diff-checkstyle-hbase-mapreduce.txt
Max. process+thread count 96 (vs. ulimit of 30000)
modules C: hbase-mapreduce U: hbase-mapreduce
Console output https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3232/1/console
versions git=2.17.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 1m 6s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 💚 mvninstall 3m 55s master passed
+1 💚 compile 0m 27s master passed
+1 💚 shadedjars 8m 4s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 22s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 3m 42s the patch passed
+1 💚 compile 0m 28s the patch passed
+1 💚 javac 0m 28s the patch passed
+1 💚 shadedjars 8m 10s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 20s the patch passed
_ Other Tests _
+1 💚 unit 11m 27s hbase-mapreduce in the patch passed.
39m 18s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3232/1/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile
GITHUB PR #3232
Optional Tests javac javadoc unit shadedjars compile
uname Linux 866ba9bc1fbe 4.15.0-65-generic #74-Ubuntu SMP Tue Sep 17 17:06:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 90f9864
Default Java AdoptOpenJDK-1.8.0_282-b08
Test Results https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3232/1/testReport/
Max. process+thread count 3528 (vs. ulimit of 30000)
modules C: hbase-mapreduce U: hbase-mapreduce
Console output https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3232/1/console
versions git=2.17.1 maven=3.6.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 1m 13s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 💚 mvninstall 4m 59s master passed
+1 💚 compile 0m 29s master passed
+1 💚 shadedjars 9m 4s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 22s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 4m 44s the patch passed
+1 💚 compile 0m 29s the patch passed
+1 💚 javac 0m 29s the patch passed
+1 💚 shadedjars 9m 6s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 20s the patch passed
_ Other Tests _
+1 💚 unit 14m 35s hbase-mapreduce in the patch passed.
46m 38s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3232/1/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #3232
Optional Tests javac javadoc unit shadedjars compile
uname Linux 4d79fc4a9bf5 4.15.0-136-generic #140-Ubuntu SMP Thu Jan 28 05:20:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 90f9864
Default Java AdoptOpenJDK-11.0.10+9
Test Results https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3232/1/testReport/
Max. process+thread count 2798 (vs. ulimit of 30000)
modules C: hbase-mapreduce U: hbase-mapreduce
Console output https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3232/1/console
versions git=2.17.1 maven=3.6.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@wchevreuil wchevreuil changed the title Fix HBASE-25839 Bulk Import fails with java.io.IOException: Type mismatch in value from map HBASE-25839 Bulk Import fails with java.io.IOException: Type mismatch in value from map May 5, 2021
out.writeInt(PrivateCellUtil.estimatedSerializedSizeOfKey(kv));
out.writeInt(0);
PrivateCellUtil.writeFlatKey(kv, out);
KeyValueUtil.write(new KeyValue(kv), out);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrivateCellUtil.writeFlatKey writes the key value in a different format from KeyValueUtil.write. This breaks compatibility, can we just keep writing in the same format as before?

Copy link
Author

@offermannu offermannu May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever CellWritableComparable.write()writes must be compatible with CellWritableComparable.readFields
which is currently not the case (see this exception in HBASE-25839. The phrase "keyLength=0" inside the exception message comes actually from the statement out.writeIn(0)in line 139).

My proposal aligns the "write" with the existing CellWritableComparable.readFields. Theoretically one can adjust CellWritableComparable.readFields so that it becomes compatible with the current write method but this looks more complicated to me.

AFAICS the key is only used during the mapper sorting phase (see stacktrace). The reducer doesn't care about the key at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are basically reverting changes from commit 0a24178 from 4 years ago (HBASE-18649). Can you get rid of that exception if you just remove the out.writeInt(0) there? I'm not sure, though, on other potential side effects. Maybe we would also need to change Import.readFields to stay aligned with HBASE-18649 (I guess we could use CellBuilder to construct a Cell directly).

Copy link
Author

@offermannu offermannu May 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this, but it doesn't work unfortunately. If you look at the readFields method you'll see that it calls KeyValue.create(DataInput in) which in turn calls KeyValue.create(int length, DataInput in). The length is read as integer from the DataInput which kind of matches what write has written before in line 138 - so far so good (I'm refering to original version btw).
Honestly I'm a little bit confused at this point and don't know whether length expects the key length or the length of the whole key-value pair. You'll see that KeyValue allocates a byte array of length length(line 2266) and so I guess that it must be the length of key+value.

Now keep in mind that the first 4 bytes are consumed from DataInput (the length) and KeyValue.create calls:
in.readFully(bytes); return new KeyValue(bytes, 0, length); - thus bytes starts with four '0' bytes coming from Import.write line 139: out.writeInt(0);!

The KeyValue constructor in turn calls KeyValueUtil.checkKeyValueBytes. And here I discovered that the serialization can never match the deserialization because KeyValueUtil.checkKeyValueBytes expects the key len at the beginning of the bytesbuffer (note that offset==0) but this is zero as mentioned before. The check finally results into the exception: Error: java.lang.IllegalArgumentException: Invalid key length in KeyValue. keyLength=0, KeyValueBytesHex=\x00\x00\x00\x00\x00Lc30f5d93[...], offset=0, length=97

To sum it up:
CellWritableComparable.writewrites:
[ [4 bytes key length], [4 bytes '0'], [...] ]

CellWritableComparable.readFields expects:
[ [4bytes key-value length], [4 bytes key length], [...] ]

Honestly, I wonder how that ever worked ;-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so how about the other option I mentioned in my last comment:

Maybe we would also need to change Import.readFields to stay aligned with HBASE-18649 (I guess we could use CellBuilder to construct a Cell directly).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the implementation enough to evaluate the alternatives. As far as I can tell CellWritableComparable.readFields uses kind of standard deserialization methods while CellWritableComparable.write implements its own serialization methods.

But before we find out what is the best way we should ask ourselfs if this feature is used at all. Since the code is broken for at least 4 years it seems to me that nobody uses the large import feature.

I ran into this problem because we had to import 2,5 billion records into a new HBase/Hadoop cluster. The direct import failed because the regionserver was flooded with write requests and we got lots of RegionTooBusyExceptions. So we tried the bulk import and ran into this bugs which I tried to fix. Unfortunately it didn't work either because YARN produced too many cache files for local sorting and the disks ran full - but that's another story.

Finally we found a way to get the direct import working again which means I don't have a testing environment anymore.

IMHO if the feature is not used at all and there is no test available, it would be smartest to delete the feature.

Copy link
Contributor

@anoopsjohn anoopsjohn May 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch in HBASE-18649 changed this part and made to write only the key part of the cell (rowkey + cf + q + ts + type) instead of the entire KV. So it writes key length
out.writeInt(PrivateCellUtil.estimatedSerializedSizeOfKey(kv));
And then value len as 0
and then key part.
But this is not compatible with counter part read where it expects a prefix int part which says the overall length of this KV (Ya what we wrote is still a KV , or call it Cell, where value is empty).
So I agree to the fact that this never worked.
We can not get rid of out.writeInt(0).
On the other hand we have to write this entire KV length what we are going to write. This is key len + value len(0) + 4 + 4 (we write these extra 4 bytes twice for these 2 lengths too)

int keyLen = CellUtil.estimatedSerializedSizeOfKey(kv);
int valueLen = 0; // We avoid writing value here. So just serialize as if an empty value.
out.writeInt(keyLen+ valueLen + KeyValue.KEYVALUE_INFRASTRUCTURE_SIZE);
out.writeInt(keyLen);
out.writeInt(valueLen);
CellUtil.writeFlatKey(kv, out);

Not sure whether we already have some util method for this. Else worth adding?

Above will work @offermannu ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants