Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for #191 #197

Merged
merged 40 commits into from
Jun 28, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
5398859
fix #191
nck-mlcnv Apr 8, 2023
58d71bb
simple fix for countLines method
nck-mlcnv Apr 8, 2023
23470d2
Merge branch 'develop' into fix/issue-191/readline-with-different-lin…
nck-mlcnv Apr 11, 2023
6798f8f
add IndexedLineReader with tests
nck-mlcnv Apr 12, 2023
6b8add0
change the IndexedLineReader to comply with current indexing implemen…
nck-mlcnv Apr 14, 2023
ae4bae6
update FileUtils
nck-mlcnv Apr 14, 2023
8bec05d
change getLineEnding method in FileUtils class
nck-mlcnv Apr 18, 2023
78ebd18
refactor IndexedLineReader
nck-mlcnv Apr 18, 2023
5848fe2
restructure FileUtilsTest
nck-mlcnv Apr 18, 2023
ebd623b
fix javadoc
nck-mlcnv Apr 18, 2023
343becc
fix javadoc again
nck-mlcnv Apr 21, 2023
5b52ab2
correct try-resource-blocks
nck-mlcnv Apr 21, 2023
b0ab294
add a test for IndexedLineReader
nck-mlcnv Apr 21, 2023
0077efb
correct try-resource-block for streams too
nck-mlcnv Apr 21, 2023
70a172e
make countLines method skip lines that only contain whitespace charac…
nck-mlcnv May 15, 2023
4398bdc
add docs for constructor of FileSeparatorQuerySource
nck-mlcnv May 15, 2023
41cf274
rename IndexedLineReader to IndexedQueryReader
nck-mlcnv May 17, 2023
1d8fbf6
change try-with-resource instructions
nck-mlcnv May 17, 2023
bd76573
refactor indexing
nck-mlcnv May 25, 2023
5ae9166
fix unit tests
nck-mlcnv May 25, 2023
23eb002
fix size method
nck-mlcnv May 25, 2023
7fed3c9
refactor default separator in FileSeparatorQuerySource
nck-mlcnv May 25, 2023
adc47b4
fix more tests
nck-mlcnv May 26, 2023
065e8fe
fix documentation
nck-mlcnv May 26, 2023
1157ce0
update constructor and readQuery
nck-mlcnv May 26, 2023
74cb9ff
remove unused methods and other minor changes
nck-mlcnv Jun 15, 2023
4de0a20
fix indexFile method and add more test cases
nck-mlcnv Jun 17, 2023
b7b6396
Fix/issue 191/rework parsing (#211)
bigerl Jun 21, 2023
8e1a35d
small change to test
nck-mlcnv Jun 23, 2023
1a86dcf
update documentation
nck-mlcnv Jun 23, 2023
5783fd5
fix QueryHandlerTest
nck-mlcnv Jun 23, 2023
90404f9
Merge branch 'develop' into fix/issue-191/readline-with-different-lin…
nck-mlcnv Jun 23, 2023
41002ac
adjust expected test results to new behaviour of the IndexedQueryReader
nck-mlcnv Jun 26, 2023
2191621
add tests for getLineEnding
nck-mlcnv Jun 26, 2023
cb57354
fix test
nck-mlcnv Jun 26, 2023
e82c165
fix test cases
nck-mlcnv Jun 26, 2023
eb88285
Refactor FileUtilsTest to use temporary files
bigerl Jun 28, 2023
18f7d33
Ensure temporary test files are deleted
bigerl Jun 28, 2023
ff7bbc0
Update FileUtilsTest to use temp file
bigerl Jun 28, 2023
b41fbd2
Refactor FileUtilsTest for safer and more dynamic test data
bigerl Jun 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
package org.aksw.iguana.cc.utils;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;

/**
* This class creates an object, that indexes the starting positions of the lines inside a file for faster access. <br/>
* The indexing happens on either:
* <ul>
* <li>the file's line endings, or</li>
* <li>a custom line separator</li>
* </ul>
* This class does not index blank lines.
*/
public class IndexedLineReader {
nck-mlcnv marked this conversation as resolved.
Show resolved Hide resolved

/** This list stores the indices of the bytes, at which a new line starts. (Stores Integers, so it only supports
* files that are roughly 2GB big.)*/
private ArrayList<Integer> indices;

private String filepath;

/** Stores the filesize in number of bytes. */
private int filesize;

/** If no line separator is given through the constructor, this string stays empty. */
private String separator = "";
nck-mlcnv marked this conversation as resolved.
Show resolved Hide resolved

private int size;

public IndexedLineReader(String filepath) throws IOException {
nck-mlcnv marked this conversation as resolved.
Show resolved Hide resolved
this.filepath = filepath;
this.filesize = (int) Files.size(Paths.get(filepath));
this.indexFile();
}

public IndexedLineReader(String filepath, String separator) throws IOException {
this.filepath = filepath;
this.filesize = (int) Files.size(Paths.get(filepath));
this.separator = separator;
this.indexFile(separator);
}

/**
* This method reads a line from the file at a given index. It will skip every byte of data, that is written in the
* file before the line. <br/>
* If the lines were indexed based on a custom line separator, this method returns every character between two of
* the given line separators (the beginning and ending of the file count as line separators too).
* @param index the index of the line
* @return the searched line
* @throws IOException
*/
public String readLine(int index) throws IOException {
nck-mlcnv marked this conversation as resolved.
Show resolved Hide resolved
if(!separator.isEmpty()) {
// If a custom line separator has been used, the method readLine from RandomAccessFile can't be used, thus
// a different implementation is needed
return readLinesBetweenSeparator(index);
}

RandomAccessFile raf = new RandomAccessFile(this.filepath, "r");
raf.seek(this.indices.get(index));
String output = raf.readLine();
raf.close();
return output;
}

/**
* This method return a list that contains every line of the file.
* @return list of lines
* @throws IOException
*/
public List<String> readLines() throws IOException {
nck-mlcnv marked this conversation as resolved.
Show resolved Hide resolved
ArrayList<String> out = new ArrayList<>();
for(int i = 0; i < indices.size(); i++) {
out.add(this.readLine(i));
}
return out;
}

/**
* Returns the number of non-blank lines the file contains.
* @return number of non-blank lines
*/
public int size() {
return this.size;
}

/**
* This method reads the bytes of data between the given index and the next index (or file end), parses it to a
* string and removes every character that appears after a line separator and the line separator as well.
* @param index
* @return the string between two line separators
* @throws IOException
*/
private String readLinesBetweenSeparator(int index) throws IOException {
byte[] data;
if((this.indices.size() - 1) == index) {
data = new byte[this.filesize - this.indices.get(index)];
} else {
data = new byte[this.indices.get(index + 1) - this.indices.get(index)];
}

RandomAccessFile raf = new RandomAccessFile(this.filepath, "r");
raf.seek(this.indices.get(index));
raf.read(data);
raf.close();

String output = new String(data, StandardCharsets.UTF_8);
int separatorIndex = output.indexOf(this.separator);
if(separatorIndex != -1)
// Remove the separator and every character after it from the string
output = output.substring(0, output.indexOf(this.separator));
return output;
}

/** Indexes the lines based on its own line endings. This method ignores blank lines. */
private void indexFile() throws IOException {
this.indices = new ArrayList<>();
try(BufferedReader br = new BufferedReader(new FileReader(this.filepath, StandardCharsets.UTF_8))) {
// The method needs to know the length of the line ending used in the file to be able to properly calculate
// the starting byte position of a line
int lineEndingLength = FileUtils.getLineEnding(this.filepath).length();
int index = 0;
String line;
while((line = br.readLine()) != null) {
nck-mlcnv marked this conversation as resolved.
Show resolved Hide resolved
if(!line.isBlank()){
this.indices.add(index);
this.size++;
}
index += line.length() + lineEndingLength;
bigerl marked this conversation as resolved.
Show resolved Hide resolved
}
}
}

/**
* Indexes the lines based on a custom line separator. If the content between two line separators is blank, this
* method won't index that line.
* @param separator the custom line separator
* @throws IOException
*/
private void indexFile(String separator) throws IOException {
this.indices = new ArrayList<>();
try(BufferedReader br = new BufferedReader(new FileReader(this.filepath, StandardCharsets.UTF_8))) {
// The method needs to know the length of the line ending used in the file to be able to properly calculate
// the starting byte position of a line
int lineEndingLength = FileUtils.getLineEnding(this.filepath).length();
int currentIndex = 0;

// The last stored index in the list
int lastIndex = 0;

// Used to check if every line between two separators is blank
boolean blank = true;

String line;
while((line = br.readLine()) != null) {
nck-mlcnv marked this conversation as resolved.
Show resolved Hide resolved
if(line.isBlank()) {
currentIndex += line.length() + lineEndingLength;
continue;
}

if(line.contains(separator)) {
if(!blank) {
this.indices.add(lastIndex);
this.size++;
}
currentIndex += line.length() + lineEndingLength;
lastIndex = currentIndex;
blank = true;
continue;
}

blank = false;
currentIndex += line.length() + lineEndingLength;
}
if(!blank) {
this.indices.add(lastIndex);
this.size++;
}
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
package org.aksw.iguana.cc.utils;

import org.junit.Test;

import java.io.IOException;

import static org.junit.Assert.assertEquals;

public class IndexedLineReaderTest {

@Test
public void testIndexingWithLineEndings() throws IOException {
IndexedLineReader reader1 = new IndexedLineReader("src/test/resources/readLineTestFile1.txt");
IndexedLineReader reader2 = new IndexedLineReader("src/test/resources/readLineTestFile2.txt");
IndexedLineReader reader3 = new IndexedLineReader("src/test/resources/readLineTestFile3.txt");

assertEquals("line 1", reader1.readLine(0));
assertEquals("line 1", reader2.readLine(0));
assertEquals("line 1", reader3.readLine(0));
assertEquals("line 2", reader1.readLine(1));
assertEquals("line 2", reader2.readLine(1));
assertEquals("line 2", reader3.readLine(1));
assertEquals("line 3", reader1.readLine(2));
assertEquals("line 3", reader2.readLine(2));
assertEquals("line 3", reader3.readLine(2));
assertEquals("line 4", reader1.readLine(3));
assertEquals("line 4", reader2.readLine(3));
assertEquals("line 4", reader3.readLine(3));
}

@Test
public void testIndexingWithCustomSeparator() throws IOException {
IndexedLineReader reader1 = new IndexedLineReader("src/test/resources/utils/indexingtestfile1.txt", "#####");

assertEquals("line 1\r\n", reader1.readLine(0));
assertEquals("\r\nline 2\r\n", reader1.readLine(1));

IndexedLineReader reader2 = new IndexedLineReader("src/test/resources/utils/indexingtestfile2.txt", "#####");
assertEquals("\r\nline 0\r\n", reader2.readLine(0));
assertEquals("line 1\r\n", reader2.readLine(1));
assertEquals("\r\nline 2\r\n", reader2.readLine(2));
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@

#####
line 1
#####
#####

#####
#####
#####
#####

#####

line 2
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@

line 0
#####

#####
line 1
#####

line 2