Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extracting only a part of the tea file #28

Open
pavlexander opened this issue Oct 17, 2023 · 2 comments
Open

extracting only a part of the tea file #28

pavlexander opened this issue Oct 17, 2023 · 2 comments

Comments

@pavlexander
Copy link

pavlexander commented Oct 17, 2023

I would like to select/read only a set of data from file based on criteria. Is there an optimal approach for doing it?

attempt 1

model

    public struct CandleInDbNew
    {
        public uint OpenTs;

        public decimal OpenPrice;
        public decimal HighPrice;
        public decimal LowPrice;
        public decimal ClosePrice;

        public uint TradeCount;

        public decimal Volume;
        public decimal QuoteAssetVolume;
        public decimal TakerBuyBaseAssetVolume;
        public decimal TakerBuyQuoteAssetVolume;
    }

method

        public List<CandleInDbNew> GetCandlesInRange(
            string fileFullPath,
            uint from)
        {
            var result = new List<CandleInDbNew>();

            if (!File.Exists(fileFullPath))
            {
                return result;
            }

            using (var tf = TeaFile<CandleInDbNew>.OpenRead(fileFullPath,
                    ItemDescriptionElements.FieldNames |
                    ItemDescriptionElements.FieldTypes |
                    ItemDescriptionElements.FieldOffsets |
                    ItemDescriptionElements.ItemSize))
            {
                foreach (var item in tf.Items)
                {
                    if (item.OpenTs >= from)
                        result.Add(item);
                }
            }

            return result;
        }

Given that my data in file is sorted by OpenTs I would like to filter out the values that are not within a specific range as in example above.

issue

This approach is really inefficient, because the whole Item is being read and mapped right away. It's slow. Not solving the problem.

attempt 2

I have also tried using the unmapped approach. But exception is thrown upon read

System.IO.IOException: 'Decimal constructor requires an array or span of four valid decimal bytes.'

image

I have managed to extract part of the data that causes the issue. https://github.com/pavlexander/testfile/blob/main/ETHBTC_big.7z

There were no issues with 10k, 50k, 100k of records. But at 1 mil of records I started getting the error.. Please download, unpack the file, then use following code to repro:

            var result = new List<CandleInDbNew>();

            using (var tf = TeaFile.OpenRead("ETHBTC_big.tea")) // exception here
            {
                var openTsColumn = tf.Description.ItemDescription.GetFieldByName("OpenTs");

                foreach (Item item in tf.Items)
                {
                    var openTs = (uint)openTsColumn.GetValue(item);

                    if (openTs >= 1692190740)
                        result.Add(default); // temporary
                }
            }

issue

even if this solution worked there is no guarantee that it would work faster than approach 1. In fact, on a smaller dataset where no exceptions are thrown - on my machine approach 1 performs many times faster than approach 2.. If we put the error aside - I also want to know how to map an item to struct..

conclusion

the original question still stands - how to filter out the data based on criteria and avoid reading all file..

@pavlexander pavlexander changed the title lazy-loading the data. Filtering the results. extracting only a part of the tea file Oct 17, 2023
@pavlexander
Copy link
Author

pavlexander commented Oct 18, 2023

attempt 3

I don't get it.. if we look at the first candle, first decimal value: 0.08m (got it from typed reader)

image

image

if we convert the value to bits var decimalBits = Decimal.GetBits(0.08M); the result is 8 0 0 131072

image

then I try to read values manually (with untyped reader):

using FileStream stream = new FileStream(fileFullPath, FileMode.Open);
using var tf = TeaFile.OpenRead(stream);
using var br = new BinaryReader(stream);

var itemAreaStart = tf.ItemAreaStart;
var openTsColumn = tf.Description.ItemDescription.GetFieldByName("OpenTs");
var openTsColumnOffset = openTsColumn.Offset;
var itemSize = tf.Description.ItemDescription.ItemSize;
var itemsCount = tf.ItemAreaSize / itemSize;
for (int i = 0; i < itemsCount; i++)
{
    var itemOffset = i * itemSize;
    var startAt = itemAreaStart + itemOffset + openTsColumnOffset;
    stream.Seek(startAt, SeekOrigin.Begin);

    var openTs = br.ReadUInt32();
    var decimalVal = br.ReadDecimal(); // exception here

    if (openTs >= from)
    {
        result.Add(default);
    }
}

but get the exception as reported previously:

Decimal constructor requires an array or span of four valid decimal bytes

image

so I started to dig further and gotten the bytes that represent the first decimal value:

                var openTs = br.ReadUInt32();

                //var decimalVal = br.ReadDecimal(); // exception
                var decimalBytes = br.ReadBytes(16);
                var decimalVal = Read(0, decimalBytes);

where Read method is:

        public static decimal Read(int startIndex, byte[] buffer)
        {
            Span<int> int32s = stackalloc int[4];
            ReadOnlySpan<byte> bufferSpan = buffer.AsSpan();
            for (int i = 0; i < 4; i++)
            {
                var slice = bufferSpan.Slice(startIndex + i * 4);
                int32s[i] = BitConverter.ToInt32(slice);
            }
            return new decimal(int32s);
        }

then I get the same kind of exception as before, but I am able to verify the bits that represent the decimal number:

image

the bits are: 0 131072 0 8
as a reminder, the correct values are: 8 0 0 131072

so to me it does look like TeaFiles is saving the bytes is some weird order. Hence I can't deserialize the value manually. The untyped reader seems to be broken to me.. Unless I am doing something wrong, of course.

@thulka
Copy link
Member

thulka commented Oct 18, 2023

To answer you original question "I would like to select/read only a set of data from file based on criteria. Is there an optimal approach for doing it?":

Since the advent of 64bit machines, the secret among quant analysists of larger time series is to store structs in files and memory map them. Soon you get the problem that you have many files and do not know what kind of structs they hold. TeaFiles solve this problem by adding an (optional) description to the file. Besides that, TeaFiles just store raw structs.

Reading selected fields of such structs does not differ from reading the whole struct when the file is read via memory mapping. For sure, all fields of the structs are mapped and it can be useful to create a derived file that holds only those fields that are often read afterwards.

Reading selected fields without memory mapping is easy by reading the whole struct and then reading the required fields or by creating a reader that skips the non-required fields to avoid composing numbers like a decimal from the raw bytes. That said, TeaFiles use BinaryReader for that purpsoe which is expected to be solid but maybe (reall do not know atm) that can be done faster.

Your detailed report about problems above is a different thing, I hope to find the time to dig into that soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants