[CARBONDATA-458]Improving First time query performance #265

kumarvishal09 · 2016-10-27T17:34:37Z

Improving carbon first time query performance

Reason:

As file system cache is cleared file reading will make it slower to read and cache
In first time query carbon will have to read the footer from file data file to form the btree
Carbon reading more footer data than its required(data chunk)
There are lots of random seek is happening in carbon as column data(data page, rle, inverted index) are not stored together.

Solution:

Improve block loading time. This can be done by removing data chunk from blockletInfo and storing only offset and length of data chunk
compress presence meta bitset stored for null values for measure column using snappy
Store the metadata and data of a column together and read together this reduces random seek and improve IO

Note
Since there is change in schema.thrift and hence carbon-format jar has to be build and deployed.

kumarvishal09 · 2016-11-29T17:29:11Z

http://136.243.101.176:8080/job/ApacheCarbonManualPRBuilder/711/

jackylk · 2016-11-30T14:40:43Z

core/src/main/java/org/apache/carbondata/core/carbon/datastore/DataRefNode.java

@@ -74,7 +74,7 @@
   * @param blockIndexes indexes of the blocks need to be read


can you add more description of blockIndexes, it is two dimensional array, what does each dimension mean

jackylk · 2016-11-30T14:42:52Z

core/src/main/java/org/apache/carbondata/core/carbon/datastore/block/TableBlockInfo.java

    this.filePath = FileFactory.getUpdatedFilePath(filePath);
    this.blockOffset = blockOffset;
    this.segmentId = segmentId;
    this.locations = locations;
    this.blockLength = blockLength;
    this.blockletInfos = blockletInfos;
+    this.version = version;


somewhere should check the validity of the version, where it is checking?

validateCarbonDataFileVersion method is CrabonProperties I am validating the version

jackylk · 2016-11-30T14:44:18Z

...n/java/org/apache/carbondata/core/carbon/datastore/chunk/reader/CarbonDataReaderFactory.java

+  public DimensionColumnChunkReader getDimensionColumnChunkReader(short version,
+      BlockletInfo blockletInfo, int[] eachColumnValueSize, String filePath) {
+    switch (version) {
+      case 2:


should have enum for version

jackylk · 2016-11-30T14:48:16Z

...a/core/carbon/datastore/chunk/reader/dimension/CompressedDimensionChunkFileBasedReader2.java

+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.carbondata.core.carbon.datastore.chunk.reader.dimension;


can we create a different package for V2 reader? so that V1 and V2 package is separated, I think it will be more clear

Ok I will move and rename all the classes as per your suggestion

jackylk · 2016-11-30T14:48:41Z

...a/core/carbon/datastore/chunk/reader/dimension/CompressedDimensionChunkFileBasedReader2.java

+/**
+ * Compressed dimension chunk reader class for version 2
+ */
+public class CompressedDimensionChunkFileBasedReader2 extends AbstractChunkReader {


rename to CompressedDimensionChunkFileBasedReaderV2

jackylk · 2016-11-30T14:49:26Z

...a/core/carbon/datastore/chunk/reader/dimension/CompressedDimensionChunkFileBasedReader2.java

+      System.arraycopy(data, copySourcePoint, compressedIndexPage, 0,
+          dimensionColumnChunk.rowid_page_length);
+      copySourcePoint += dimensionColumnChunk.rowid_page_length;
+      invertedIndexes = CarbonUtil


move CarbonUtil to next line and break at paramenters

jackylk · 2016-11-30T15:01:13Z

...apache/carbondata/core/carbon/datastore/chunk/reader/measure/AbstractMeasureChunkReader.java

-import org.apache.carbondata.core.carbon.metadata.blocklet.datachunk.DataChunk;
-import org.apache.carbondata.core.datastorage.store.compression.ValueCompressionModel;
-import org.apache.carbondata.core.datastorage.store.compression.ValueCompressonHolder;
-import org.apache.carbondata.core.datastorage.store.compression.ValueCompressonHolder.UnCompressValue;

 /**
 * Measure block reader abstract class
 */
 public abstract class AbstractMeasureChunkReader implements MeasureColumnChunkReader {


It seems this class is not implementing any interface function of MeasureColumnChunkReader

Currently it is not implementing any method but in future it may have some common method

jackylk · 2016-11-30T15:01:54Z

...carbondata/core/carbon/datastore/chunk/reader/measure/CompressedMeasureChunkFileReader2.java

+/**
+ * Class to read the measure column data for version 2
+ */
+public class CompressedMeasureChunkFileReader2 extends AbstractMeasureChunkReader {


please rename all V2 class to end with V2

jackylk · 2016-11-30T15:02:27Z

...carbondata/core/carbon/datastore/chunk/reader/measure/CompressedMeasureChunkFileReader2.java

+    byte[] data = null;
+    byte[] dataPage = null;
+    if (measureColumnChunkOffsets.size() - 1 == blockIndex) {
+      measureDataChunk = fileReader


move fileReader to next line

jackylk · 2016-11-30T15:03:26Z

core/src/main/java/org/apache/carbondata/core/carbon/metadata/blocklet/DataFileFooter.java

@@ -38,7 +38,7 @@
  /**
   * version used for data compatibility
   */
-  private int versionId;
+  private short versionId;


why change version from int to short?

short range is enough to store version. Please let me know if need to store in int is required

jackylk · 2016-11-30T15:06:13Z

core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java

+  /**
+   * current data file version
+   */
+  public static final short CARBON_DATA_FILE_CURRENT_VERSION = 2;


it is not easy to understand what current means, can you change to a more meaning name, and use enum for version number.
The default version is 1 or 2?

ok i will change it to some meaning full name

change 2 to enum also

jackylk · 2016-11-30T15:06:39Z

core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java

+  /**
+   * If the level 2 compaction is done in minor then new compacted segment will end with .2
+   */
+  public static String LEVEL2_COMPACTION_INDEX = ".2";


why need this in this PR, is't it for compaction?

By minstake i have checked in, will remove this constant

jackylk · 2016-11-30T15:08:04Z

core/src/main/java/org/apache/carbondata/core/util/AbstractDataFileFooterConverter.java

+      case DIRECT_DICTIONARY:
+        return Encoding.DIRECT_DICTIONARY;
+      default:
+        return Encoding.DICTIONARY;


is this really default or we should throw exception

Ok, I will fix

jackylk · 2016-11-30T15:09:29Z

please add testcase for:

write V2 file and read
write V1 file and read
read existing V1 file

kumarvishal09 · 2016-12-01T05:35:27Z

@jackylk all the existing test cases are running with V2 and some I have updated with V1 version. ok I will add test cases for reading V1 version file

jackylk · 2016-12-01T08:15:29Z

core/src/main/java/org/apache/carbondata/core/util/CarbonProperties.java

+   * if parameter is invalid current version will be set
+   */
+  private void validateCarbonDataFileVersion() {
+    short carbondataFileVersion = CarbonCommonConstants.CARBON_DATA_FILE_CURRENT_VERSION;


this initial value is of no use

jackylk · 2016-12-01T08:18:27Z

core/src/main/java/org/apache/carbondata/core/util/CarbonProperties.java

+          .setProperty(CarbonCommonConstants.CARBON_DATA_FILE_VERSION, carbondataFileVersion + "");
+    }
+    if (carbondataFileVersion > CarbonCommonConstants.CARBON_DATA_FILE_CURRENT_VERSION
+        || carbondataFileVersion < 0) {


This checking is not correct, we should also 1 and 2, but not 0, 1, 2

jackylk · 2016-12-01T08:20:38Z

hadoop/src/main/java/org/apache/carbondata/hadoop/CarbonInputFormat.java

-    configuration.set(CarbonInputFormat.INPUT_SEGMENT_NUMBERS,
-        CarbonUtil.getSegmentString(validSegments));
+    configuration
+        .set(CarbonInputFormat.INPUT_SEGMENT_NUMBERS, CarbonUtil.getSegmentString(validSegments));


break the line at parameter

jackylk · 2016-12-01T08:21:13Z

hadoop/src/main/java/org/apache/carbondata/hadoop/CarbonInputFormat.java

+        .set(CarbonInputFormat.INPUT_SEGMENT_NUMBERS, CarbonUtil.getSegmentString(validSegments));
+  }
+
+  private static AbsoluteTableIdentifier getAbsoluteTableIdentifier(Configuration configuration) {


rename to getIdentifier to make it shorter

jackylk · 2016-12-01T08:23:00Z

hadoop/src/main/java/org/apache/carbondata/hadoop/CarbonInputFormat.java

@@ -193,8 +203,7 @@ public static void setSegmentsToAccess(Configuration configuration, List<String>
   * @return List<InputSplit> list of CarbonInputSplit
   * @throws IOException
   */
-  @Override
-  public List<InputSplit> getSplits(JobContext job) throws IOException {
+  @Override public List<InputSplit> getSplits(JobContext job) throws IOException {


move Override to previous line

kumarvishal09 · 2016-12-01T09:19:09Z

http://136.243.101.176:8080/job/ApacheCarbonManualPRBuilder/org.apache.carbondata$carbondata-spark/719/testReport/

kumarvishal09 · 2016-12-01T09:20:07Z

Note
Since there is change in schema.thrift and hence carbon-format jar has to be build and deployed.

jackylk · 2016-12-01T09:44:37Z

LGTM

kumarvishal09 force-pushed the FirstTimeQueryPerformanceImprovement branch 3 times, most recently from 25b1653 to ccf00b8 Compare October 28, 2016 05:43

kumarvishal09 force-pushed the FirstTimeQueryPerformanceImprovement branch 6 times, most recently from dc1ff07 to eacc8d4 Compare November 29, 2016 17:04

kumarvishal09 changed the title ~~[WIP]Improve first time query performance~~ [CARBONDATA-458]Improve first time query performance Nov 29, 2016

kumarvishal09 changed the title ~~[CARBONDATA-458]Improve first time query performance~~ [CARBONDATA-458]Improving First time query performance Nov 29, 2016

kumarvishal09 force-pushed the FirstTimeQueryPerformanceImprovement branch from eacc8d4 to 668de2d Compare November 29, 2016 17:30

Improve first time query performance

14f5b83

kumarvishal09 force-pushed the FirstTimeQueryPerformanceImprovement branch from 668de2d to 1d1b4da Compare November 30, 2016 08:38

jackylk reviewed Nov 30, 2016

View reviewed changes

jackylk reviewed Dec 1, 2016

View reviewed changes

kumarvishal09 force-pushed the FirstTimeQueryPerformanceImprovement branch 2 times, most recently from e1ca8c8 to c7a6bc1 Compare December 1, 2016 09:09

Rebased

7a913d9

kumarvishal09 force-pushed the FirstTimeQueryPerformanceImprovement branch from c7a6bc1 to 7a913d9 Compare December 1, 2016 09:21

asfgit closed this in 7213ac0 Dec 1, 2016

		@@ -74,7 +74,7 @@
		* @param blockIndexes indexes of the blocks need to be read

[CARBONDATA-458]Improving First time query performance #265

[CARBONDATA-458]Improving First time query performance #265

Conversation

kumarvishal09 commented Oct 27, 2016 • edited Loading

kumarvishal09 commented Nov 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackylk commented Nov 30, 2016 • edited Loading

kumarvishal09 commented Dec 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kumarvishal09 commented Dec 1, 2016

kumarvishal09 commented Dec 1, 2016

jackylk commented Dec 1, 2016

kumarvishal09 commented Oct 27, 2016 •

edited

Loading

jackylk commented Nov 30, 2016 •

edited

Loading