GH-36703: [Java] Enable HDFS by default on Java Dataset module #36704

davisusanibar · 2023-07-15T19:11:17Z

Rationale for this change

To close apache/arrow-java#181

What changes are included in this PR?

Enable HDFS by default on Java Dataset module

Are these changes tested?

Yes

Are there any user-facing changes?

Yes

Closes: [Java] Enable HDFS by default on Java Dataset module arrow-java#181
GitHub Issue: ARROW-345: Proper locations of libhdfs and libjvm on Mac #181

github-actions · 2023-07-15T19:11:49Z

⚠️ GitHub issue apache/arrow-java#181 has been automatically assigned in GitHub to PR creator.

danepitkin · 2023-08-03T18:14:08Z

Do we still want to do this?

davisusanibar · 2023-08-15T16:12:14Z

Do we still want to do this?

Yes,

I am able to read HDFS parquet files, but the program will not shut down for some reason.

import org.apache.arrow.dataset.file.FileFormat;
import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
import org.apache.arrow.dataset.jni.NativeMemoryPool;
import org.apache.arrow.dataset.scanner.ScanOptions;
import org.apache.arrow.dataset.scanner.Scanner;
import org.apache.arrow.dataset.source.Dataset;
import org.apache.arrow.dataset.source.DatasetFactory;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.ipc.ArrowReader;
import org.apache.arrow.vector.types.pojo.Schema;

public class ReadHdfsParquet {
    public static void main(String[] args) {
        //declare JVM environment variable: HADOOP_HOME = /Users/dsusanibar/hadoop-3.3.2
        //where to search for: lib/native/libhdfs.dylib
        String uri = "hdfs://localhost:9000/Users/dsusanibar/data4_2rg_gzip.parquet";
        ScanOptions options = new ScanOptions(/*batchSize*/ 32768);
        try (
            BufferAllocator allocator = new RootAllocator();
            DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
            Dataset dataset = datasetFactory.finish();
            Scanner scanner = dataset.newScan(options);
            ArrowReader reader = scanner.scanBatches()
        ) {
            Schema schema = scanner.schema();
            System.out.println(schema);
            while (reader.loadNextBatch()) {
                System.out.println(reader.getVectorSchemaRoot().contentToTSVString());
                System.out.println("RowCount: " + reader.getVectorSchemaRoot().getRowCount());
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

@danepitkin Could you help me if you have the same problem on your side?

danepitkin · 2023-08-15T16:19:21Z

I can take a look!

zinking · 2023-08-23T06:06:08Z

#37323

@davisusanibar could this be pushed forward?

davisusanibar · 2024-03-05T13:12:49Z

In order to define a better alternative solution, the current pull request has been closed

github-actions · 2024-11-26T19:17:00Z

⚠️ GitHub issue #36703 has no components, please add labels for components.

feat: enable HDFS on Java Dataset module

1ee432d

github-actions bot added the awaiting review Awaiting review label Jul 15, 2023

davisusanibar mentioned this pull request Jul 15, 2023

Write Arrow Objects to Parquet (Java) apache/arrow-cookbook#315

Open

davisusanibar closed this Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-36703: [Java] Enable HDFS by default on Java Dataset module #36704

GH-36703: [Java] Enable HDFS by default on Java Dataset module #36704

Uh oh!

davisusanibar commented Jul 15, 2023 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 15, 2023

Uh oh!

danepitkin commented Aug 3, 2023

Uh oh!

davisusanibar commented Aug 15, 2023

Uh oh!

danepitkin commented Aug 15, 2023

Uh oh!

zinking commented Aug 23, 2023

Uh oh!

davisusanibar commented Mar 5, 2024

Uh oh!

github-actions bot commented Nov 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GH-36703: [Java] Enable HDFS by default on Java Dataset module #36704

GH-36703: [Java] Enable HDFS by default on Java Dataset module #36704

Uh oh!

Conversation

davisusanibar commented Jul 15, 2023 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Jul 15, 2023

Uh oh!

danepitkin commented Aug 3, 2023

Uh oh!

davisusanibar commented Aug 15, 2023

Uh oh!

danepitkin commented Aug 15, 2023

Uh oh!

zinking commented Aug 23, 2023

Uh oh!

davisusanibar commented Mar 5, 2024

Uh oh!

github-actions bot commented Nov 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davisusanibar commented Jul 15, 2023 •

edited by github-actions bot

Loading