Skip to content

Conversation

@moos
Copy link

@moos moos commented May 12, 2012

Hi Gabriel,

Thanks for this nice package. I needed a way to get the byte offset of the line I was reading, so I wend ahead and added the byte offset to the callback for the 'byte', 'character', 'buffer', and 'line' events. 'line' event also gets the line no.

Sorry for the white-space mix up, I have it set to no-tabs. If you do a 'git diff --ignore-all-space' you'll see just the changes.

Also I did a bit of optimization on the event callbacks. Don't know if it makes much difference, but better to be on the safe side. (I'm reading multi MB files).

Should probably add test cases if you decide to incorporate these.

Best,
moos

Here's the diff less white-space clutter:

diff --git a/src/buffered-reader.js b/src/buffered-reader.js
index c7352bb..42c86c8 100644
--- a/src/buffered-reader.js
+++ b/src/buffered-reader.js
@@ -70,9 +70,14 @@ BufferedReader.prototype.read = function (){
   var lastChunk;
   var buffer;
   var me = this;
+  var lineOffset = 0,
+    lineCount = 0,
+    byteOffset = 0;

-   var loop = this.listeners ("character").length !== 0 || this.listeners ("line").length !== 0 ||
-       this.listeners ("byte").length !== 0;
+  var onChar = this.listeners ("character").length !== 0,
+    onLine = this.listeners ("line").length !== 0,
+    onByte = this.listeners ("byte").length !== 0,
+    loop = onChar || onLine || onByte;

   stream.on ("data", function (data){
     buffer = data;
@@ -87,26 +92,30 @@ BufferedReader.prototype.read = function (){

         character = data[i];
         if (stream.encoding){
-                   me.emit ("character", character === "\r" ? "\n" : character);
+          onChar && me.emit ("character", character === "\r" ? "\n" : character, byteOffset + i);
         }else{
-                   me.emit ("byte", character);
+          onByte && me.emit ("byte", character, byteOffset + i);
           continue;
         }

+        if (!onLine) continue;
         if (character === "\n" || character === "\r"){
           chunk = data.slice (offset, i);
-                   offset = i + 1;

           if (lastChunk){
             chunk = lastChunk.concat (chunk);
-                       lastChunk = null;
           }

           if (i + 1 !== len && character === "\r" && data[i + 1] === "\n"){
             i++;
           }

-                   me.emit ("line", chunk);
+          me.emit ("line", chunk, lineOffset + offset, ++lineCount);
+          offset = i + 1;
+          if (lastChunk){
+              lineOffset += lastChunk.length;
+              lastChunk = null;
+            }
         }
       }

@@ -114,14 +123,17 @@ BufferedReader.prototype.read = function (){
         var s = offset === 0 ? data : data.slice (offset);
         lastChunk = lastChunk ? lastChunk.concat (s) : s;
       }
+      lineOffset += offset;
     }

-       me.emit ("buffer", data);
+    me.emit ("buffer", data, byteOffset);
     if (me._interrupted){
       me._interrupted = false;
       stream.destroy ();
       me.emit ("end");
     }
+    byteOffset += len;
+
   });

   stream.on ("end", function (){

@ghost ghost assigned gagle May 12, 2012
@gagle
Copy link
Owner

gagle commented May 12, 2012

Hi moos,

Thanks for your interest. I like your improvement in the callbacks but I disagree with the line counter because this can be easily done from outside; it's not necessary to pass it as a callback parameter. Tomorrow I'll take a look at the rest of the code.

@gagle
Copy link
Owner

gagle commented May 13, 2012

I've reviewed the code and I've added only the callbacks improvement. The offsets and the line counter can be handled from outside the class with a simple counter. New version is 0.2.1.

Also, take into account that 1 byte is not always 1 char because a utf8 character can be encoded in one or more bytes. You were incrementing the byte offset counter by 1 when the stream had an encoding.

@gagle gagle closed this May 13, 2012
@moos
Copy link
Author

moos commented May 13, 2012

I don't think the offset is trivial, at least for the 'line' event. I suppose they could attach a 'character' or 'byte' event and count #bytes then subtract the current line length to get the line offset, but that's too inefficient and not quite elegant, imho.

You're right about the byte-count for utf8. I knew it needed more testing. :)

@gagle
Copy link
Owner

gagle commented May 13, 2012

I've tought a little bit more and you're right. I'll add the offset in all the events. I'm testing how to do that for utf8 encoded chars.

EDIT:
I think Buffer.byteLength() is bugged. Please see: http://stackoverflow.com/questions/10572232/buffer-bytelength-is-bugged

For testing, you can download the project and execute the test/playground.js file. If you modify the test/a file and store X (go to Unicode page (http://en.wikibooks.org/wiki/Unicode/Character_reference/24000-24FFF) and copy any character because I can't copy here a unicode above 0xFFFF) you'll see that the byteOffset parameter it's not correct.

@gagle gagle reopened this May 13, 2012
@gagle gagle closed this May 13, 2012
@gagle
Copy link
Owner

gagle commented May 13, 2012

v0.2.2 will have the offset parameter

@moos
Copy link
Author

moos commented May 20, 2012

I finally got a chance to test this and noticed your changes behave differently than mine. Your byte offset returns the post-offset, ie, the offset at the end of the entity just read. Mine is the offset at the beginning of the entity.

So 'character' event should return a 0 offset on the first character of the file, 1 on the second, etc. Similarly first 'line' event should return 0 offset (ie, beginning of the line). I think this is more useful and customary.

@gagle
Copy link
Owner

gagle commented May 20, 2012

I return the next byte offset after the reading is done to ease the reading of a file. The offset will point to the next byte to read. If you want to know the previous offset it's easy, just create a variable an initialize it to 0 and increment it.

byteOffset example:

var BufferedReader = require ("../build/buffered-reader");

console.log ("First line, expected offset: 21");
console.log ("Second line, expected offset: 61\n");

var o = 0;
new BufferedReader ("bmp", { encoding: "utf8" })
    .on ("error", function (error){
        console.log (error);
    })
    .on ("line", function (line, byteOffset){
        console.log ("start: " + o + ", end: " + (byteOffset - 1));
        o += byteOffset;
    })
    .read ();

Result:

First line, expected offset: 21
Second line, expected offset: 61

start: 0, end: 20
start: 21, end: 60

@moos
Copy link
Author

moos commented May 20, 2012

Thanks for the suggestion. For me getting the offset of the current event is important, but others may have different use case.

I might suggest renaming it to nextOffset or endOffset or at least clarifying in the docs as I think it'll confuse others as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants