Skip to content

Conversation

@davisusanibar
Copy link
Contributor

@davisusanibar davisusanibar commented Jul 15, 2023

Rationale for this change

To close apache/arrow-java#181

What changes are included in this PR?

Enable HDFS by default on Java Dataset module

Are these changes tested?

Yes

Are there any user-facing changes?

Yes

@github-actions
Copy link

⚠️ GitHub issue apache/arrow-java#181 has been automatically assigned in GitHub to PR creator.

@danepitkin
Copy link
Member

Do we still want to do this?

@davisusanibar
Copy link
Contributor Author

Do we still want to do this?

Yes,

I am able to read HDFS parquet files, but the program will not shut down for some reason.

import org.apache.arrow.dataset.file.FileFormat;
import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
import org.apache.arrow.dataset.jni.NativeMemoryPool;
import org.apache.arrow.dataset.scanner.ScanOptions;
import org.apache.arrow.dataset.scanner.Scanner;
import org.apache.arrow.dataset.source.Dataset;
import org.apache.arrow.dataset.source.DatasetFactory;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.ipc.ArrowReader;
import org.apache.arrow.vector.types.pojo.Schema;

public class ReadHdfsParquet {
    public static void main(String[] args) {
        //declare JVM environment variable: HADOOP_HOME = /Users/dsusanibar/hadoop-3.3.2
        //where to search for: lib/native/libhdfs.dylib
        String uri = "hdfs://localhost:9000/Users/dsusanibar/data4_2rg_gzip.parquet";
        ScanOptions options = new ScanOptions(/*batchSize*/ 32768);
        try (
            BufferAllocator allocator = new RootAllocator();
            DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
            Dataset dataset = datasetFactory.finish();
            Scanner scanner = dataset.newScan(options);
            ArrowReader reader = scanner.scanBatches()
        ) {
            Schema schema = scanner.schema();
            System.out.println(schema);
            while (reader.loadNextBatch()) {
                System.out.println(reader.getVectorSchemaRoot().contentToTSVString());
                System.out.println("RowCount: " + reader.getVectorSchemaRoot().getRowCount());
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

image

@danepitkin Could you help me if you have the same problem on your side?

@danepitkin
Copy link
Member

I can take a look!

@zinking
Copy link

zinking commented Aug 23, 2023

#37323

@davisusanibar could this be pushed forward?

@davisusanibar
Copy link
Contributor Author

In order to define a better alternative solution, the current pull request has been closed

@github-actions
Copy link

⚠️ GitHub issue #36703 has no components, please add labels for components.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting review Awaiting review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Java] Enable HDFS by default on Java Dataset module

3 participants