Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CARBONDATA-458]Improving First time query performance #265

Conversation

kumarvishal09
Copy link
Contributor

@kumarvishal09 kumarvishal09 commented Oct 27, 2016

Improving carbon first time query performance

Reason:

  1. As file system cache is cleared file reading will make it slower to read and cache
  2. In first time query carbon will have to read the footer from file data file to form the btree
  3. Carbon reading more footer data than its required(data chunk)
  4. There are lots of random seek is happening in carbon as column data(data page, rle, inverted index) are not stored together.

Solution:

  1. Improve block loading time. This can be done by removing data chunk from blockletInfo and storing only offset and length of data chunk
  2. compress presence meta bitset stored for null values for measure column using snappy
  3. Store the metadata and data of a column together and read together this reduces random seek and improve IO

Note
Since there is change in schema.thrift and hence carbon-format jar has to be build and deployed.

@kumarvishal09 kumarvishal09 force-pushed the FirstTimeQueryPerformanceImprovement branch 3 times, most recently from 25b1653 to ccf00b8 Compare October 28, 2016 05:43
@kumarvishal09 kumarvishal09 force-pushed the FirstTimeQueryPerformanceImprovement branch 6 times, most recently from dc1ff07 to eacc8d4 Compare November 29, 2016 17:04
@kumarvishal09 kumarvishal09 changed the title [WIP]Improve first time query performance [CARBONDATA-458]Improve first time query performance Nov 29, 2016
@kumarvishal09 kumarvishal09 changed the title [CARBONDATA-458]Improve first time query performance [CARBONDATA-458]Improving First time query performance Nov 29, 2016
@kumarvishal09
Copy link
Contributor Author

@kumarvishal09 kumarvishal09 force-pushed the FirstTimeQueryPerformanceImprovement branch from eacc8d4 to 668de2d Compare November 29, 2016 17:30
@kumarvishal09 kumarvishal09 force-pushed the FirstTimeQueryPerformanceImprovement branch from 668de2d to 1d1b4da Compare November 30, 2016 08:38
@@ -74,7 +74,7 @@
* @param blockIndexes indexes of the blocks need to be read
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add more description of blockIndexes, it is two dimensional array, what does each dimension mean

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

this.filePath = FileFactory.getUpdatedFilePath(filePath);
this.blockOffset = blockOffset;
this.segmentId = segmentId;
this.locations = locations;
this.blockLength = blockLength;
this.blockletInfos = blockletInfos;
this.version = version;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somewhere should check the validity of the version, where it is checking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validateCarbonDataFileVersion method is CrabonProperties I am validating the version

public DimensionColumnChunkReader getDimensionColumnChunkReader(short version,
BlockletInfo blockletInfo, int[] eachColumnValueSize, String filePath) {
switch (version) {
case 2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should have enum for version

* specific language governing permissions and limitations
* under the License.
*/
package org.apache.carbondata.core.carbon.datastore.chunk.reader.dimension;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we create a different package for V2 reader? so that V1 and V2 package is separated, I think it will be more clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I will move and rename all the classes as per your suggestion

/**
* Compressed dimension chunk reader class for version 2
*/
public class CompressedDimensionChunkFileBasedReader2 extends AbstractChunkReader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to CompressedDimensionChunkFileBasedReaderV2

System.arraycopy(data, copySourcePoint, compressedIndexPage, 0,
dimensionColumnChunk.rowid_page_length);
copySourcePoint += dimensionColumnChunk.rowid_page_length;
invertedIndexes = CarbonUtil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move CarbonUtil to next line and break at paramenters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

import org.apache.carbondata.core.carbon.metadata.blocklet.datachunk.DataChunk;
import org.apache.carbondata.core.datastorage.store.compression.ValueCompressionModel;
import org.apache.carbondata.core.datastorage.store.compression.ValueCompressonHolder;
import org.apache.carbondata.core.datastorage.store.compression.ValueCompressonHolder.UnCompressValue;

/**
* Measure block reader abstract class
*/
public abstract class AbstractMeasureChunkReader implements MeasureColumnChunkReader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this class is not implementing any interface function of MeasureColumnChunkReader

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it is not implementing any method but in future it may have some common method

/**
* Class to read the measure column data for version 2
*/
public class CompressedMeasureChunkFileReader2 extends AbstractMeasureChunkReader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rename all V2 class to end with V2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

byte[] data = null;
byte[] dataPage = null;
if (measureColumnChunkOffsets.size() - 1 == blockIndex) {
measureDataChunk = fileReader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move fileReader to next line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -38,7 +38,7 @@
/**
* version used for data compatibility
*/
private int versionId;
private short versionId;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change version from int to short?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

short range is enough to store version. Please let me know if need to store in int is required

/**
* current data file version
*/
public static final short CARBON_DATA_FILE_CURRENT_VERSION = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not easy to understand what current means, can you change to a more meaning name, and use enum for version number.
The default version is 1 or 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i will change it to some meaning full name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change 2 to enum also

/**
* If the level 2 compaction is done in minor then new compacted segment will end with .2
*/
public static String LEVEL2_COMPACTION_INDEX = ".2";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need this in this PR, is't it for compaction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By minstake i have checked in, will remove this constant

case DIRECT_DICTIONARY:
return Encoding.DIRECT_DICTIONARY;
default:
return Encoding.DICTIONARY;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really default or we should throw exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will fix

@jackylk
Copy link
Contributor

jackylk commented Nov 30, 2016

please add testcase for:

  1. write V2 file and read
  2. write V1 file and read
  3. read existing V1 file

@kumarvishal09
Copy link
Contributor Author

@jackylk all the existing test cases are running with V2 and some I have updated with V1 version. ok I will add test cases for reading V1 version file

* if parameter is invalid current version will be set
*/
private void validateCarbonDataFileVersion() {
short carbondataFileVersion = CarbonCommonConstants.CARBON_DATA_FILE_CURRENT_VERSION;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this initial value is of no use

.setProperty(CarbonCommonConstants.CARBON_DATA_FILE_VERSION, carbondataFileVersion + "");
}
if (carbondataFileVersion > CarbonCommonConstants.CARBON_DATA_FILE_CURRENT_VERSION
|| carbondataFileVersion < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checking is not correct, we should also 1 and 2, but not 0, 1, 2

configuration.set(CarbonInputFormat.INPUT_SEGMENT_NUMBERS,
CarbonUtil.getSegmentString(validSegments));
configuration
.set(CarbonInputFormat.INPUT_SEGMENT_NUMBERS, CarbonUtil.getSegmentString(validSegments));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break the line at parameter

.set(CarbonInputFormat.INPUT_SEGMENT_NUMBERS, CarbonUtil.getSegmentString(validSegments));
}

private static AbsoluteTableIdentifier getAbsoluteTableIdentifier(Configuration configuration) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to getIdentifier to make it shorter

@@ -193,8 +203,7 @@ public static void setSegmentsToAccess(Configuration configuration, List<String>
* @return List<InputSplit> list of CarbonInputSplit
* @throws IOException
*/
@Override
public List<InputSplit> getSplits(JobContext job) throws IOException {
@Override public List<InputSplit> getSplits(JobContext job) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move Override to previous line

@kumarvishal09 kumarvishal09 force-pushed the FirstTimeQueryPerformanceImprovement branch 2 times, most recently from e1ca8c8 to c7a6bc1 Compare December 1, 2016 09:09
@kumarvishal09
Copy link
Contributor Author

Note
Since there is change in schema.thrift and hence carbon-format jar has to be build and deployed.

@kumarvishal09 kumarvishal09 force-pushed the FirstTimeQueryPerformanceImprovement branch from c7a6bc1 to 7a913d9 Compare December 1, 2016 09:21
@jackylk
Copy link
Contributor

jackylk commented Dec 1, 2016

LGTM

@asfgit asfgit closed this in 7213ac0 Dec 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants