Adding byteoffset and line no. to callbacks. #1

moos · 2012-05-12T21:39:56Z

Hi Gabriel,

Thanks for this nice package. I needed a way to get the byte offset of the line I was reading, so I wend ahead and added the byte offset to the callback for the 'byte', 'character', 'buffer', and 'line' events. 'line' event also gets the line no.

Sorry for the white-space mix up, I have it set to no-tabs. If you do a 'git diff --ignore-all-space' you'll see just the changes.

Also I did a bit of optimization on the event callbacks. Don't know if it makes much difference, but better to be on the safe side. (I'm reading multi MB files).

Should probably add test cases if you decide to incorporate these.

Best,
moos

Here's the diff less white-space clutter:

diff --git a/src/buffered-reader.js b/src/buffered-reader.js
index c7352bb..42c86c8 100644
--- a/src/buffered-reader.js
+++ b/src/buffered-reader.js
@@ -70,9 +70,14 @@ BufferedReader.prototype.read = function (){
   var lastChunk;
   var buffer;
   var me = this;
+  var lineOffset = 0,
+    lineCount = 0,
+    byteOffset = 0;

-   var loop = this.listeners ("character").length !== 0 || this.listeners ("line").length !== 0 ||
-       this.listeners ("byte").length !== 0;
+  var onChar = this.listeners ("character").length !== 0,
+    onLine = this.listeners ("line").length !== 0,
+    onByte = this.listeners ("byte").length !== 0,
+    loop = onChar || onLine || onByte;

   stream.on ("data", function (data){
     buffer = data;
@@ -87,26 +92,30 @@ BufferedReader.prototype.read = function (){

         character = data[i];
         if (stream.encoding){
-                   me.emit ("character", character === "\r" ? "\n" : character);
+          onChar && me.emit ("character", character === "\r" ? "\n" : character, byteOffset + i);
         }else{
-                   me.emit ("byte", character);
+          onByte && me.emit ("byte", character, byteOffset + i);
           continue;
         }

+        if (!onLine) continue;
         if (character === "\n" || character === "\r"){
           chunk = data.slice (offset, i);
-                   offset = i + 1;

           if (lastChunk){
             chunk = lastChunk.concat (chunk);
-                       lastChunk = null;
           }

           if (i + 1 !== len && character === "\r" && data[i + 1] === "\n"){
             i++;
           }

-                   me.emit ("line", chunk);
+          me.emit ("line", chunk, lineOffset + offset, ++lineCount);
+          offset = i + 1;
+          if (lastChunk){
+              lineOffset += lastChunk.length;
+              lastChunk = null;
+            }
         }
       }

@@ -114,14 +123,17 @@ BufferedReader.prototype.read = function (){
         var s = offset === 0 ? data : data.slice (offset);
         lastChunk = lastChunk ? lastChunk.concat (s) : s;
       }
+      lineOffset += offset;
     }

-       me.emit ("buffer", data);
+    me.emit ("buffer", data, byteOffset);
     if (me._interrupted){
       me._interrupted = false;
       stream.destroy ();
       me.emit ("end");
     }
+    byteOffset += len;
+
   });

   stream.on ("end", function (){

gagle · 2012-05-12T22:06:56Z

Hi moos,

Thanks for your interest. I like your improvement in the callbacks but I disagree with the line counter because this can be easily done from outside; it's not necessary to pass it as a callback parameter. Tomorrow I'll take a look at the rest of the code.

gagle · 2012-05-13T09:27:50Z

I've reviewed the code and I've added only the callbacks improvement. The offsets and the line counter can be handled from outside the class with a simple counter. New version is 0.2.1.

Also, take into account that 1 byte is not always 1 char because a utf8 character can be encoded in one or more bytes. You were incrementing the byte offset counter by 1 when the stream had an encoding.

moos · 2012-05-13T09:51:00Z

I don't think the offset is trivial, at least for the 'line' event. I suppose they could attach a 'character' or 'byte' event and count #bytes then subtract the current line length to get the line offset, but that's too inefficient and not quite elegant, imho.

You're right about the byte-count for utf8. I knew it needed more testing. :)

gagle · 2012-05-13T13:12:37Z

I've tought a little bit more and you're right. I'll add the offset in all the events. I'm testing how to do that for utf8 encoded chars.

EDIT:
I think Buffer.byteLength() is bugged. Please see: http://stackoverflow.com/questions/10572232/buffer-bytelength-is-bugged

For testing, you can download the project and execute the test/playground.js file. If you modify the test/a file and store X (go to Unicode page (http://en.wikibooks.org/wiki/Unicode/Character_reference/24000-24FFF) and copy any character because I can't copy here a unicode above 0xFFFF) you'll see that the byteOffset parameter it's not correct.

gagle · 2012-05-13T17:07:15Z

v0.2.2 will have the offset parameter

moos · 2012-05-20T14:50:26Z

I finally got a chance to test this and noticed your changes behave differently than mine. Your byte offset returns the post-offset, ie, the offset at the end of the entity just read. Mine is the offset at the beginning of the entity.

So 'character' event should return a 0 offset on the first character of the file, 1 on the second, etc. Similarly first 'line' event should return 0 offset (ie, beginning of the line). I think this is more useful and customary.

gagle · 2012-05-20T15:14:52Z

I return the next byte offset after the reading is done to ease the reading of a file. The offset will point to the next byte to read. If you want to know the previous offset it's easy, just create a variable an initialize it to 0 and increment it.

byteOffset example:

var BufferedReader = require ("../build/buffered-reader");

console.log ("First line, expected offset: 21");
console.log ("Second line, expected offset: 61\n");

var o = 0;
new BufferedReader ("bmp", { encoding: "utf8" })
    .on ("error", function (error){
        console.log (error);
    })
    .on ("line", function (line, byteOffset){
        console.log ("start: " + o + ", end: " + (byteOffset - 1));
        o += byteOffset;
    })
    .read ();

Result:

First line, expected offset: 21
Second line, expected offset: 61

start: 0, end: 20
start: 21, end: 60

moos · 2012-05-20T15:31:03Z

Thanks for the suggestion. For me getting the offset of the current event is important, but others may have different use case.

I might suggest renaming it to nextOffset or endOffset or at least clarifying in the docs as I think it'll confuse others as well.

added byte offset and line no. to callbacks, optimize events

fb1d281

ghost assigned gagle May 12, 2012

gagle closed this May 13, 2012

gagle reopened this May 13, 2012

gagle closed this May 13, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding byteoffset and line no. to callbacks. #1

Adding byteoffset and line no. to callbacks. #1

Uh oh!

moos commented May 12, 2012

Uh oh!

gagle commented May 12, 2012

Uh oh!

gagle commented May 13, 2012

Uh oh!

moos commented May 13, 2012

Uh oh!

gagle commented May 13, 2012

Uh oh!

gagle commented May 13, 2012

Uh oh!

moos commented May 20, 2012

Uh oh!

gagle commented May 20, 2012

Uh oh!

moos commented May 20, 2012

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Adding byteoffset and line no. to callbacks. #1

Adding byteoffset and line no. to callbacks. #1

Uh oh!

Conversation

moos commented May 12, 2012

Uh oh!

gagle commented May 12, 2012

Uh oh!

gagle commented May 13, 2012

Uh oh!

moos commented May 13, 2012

Uh oh!

gagle commented May 13, 2012

Uh oh!

gagle commented May 13, 2012

Uh oh!

moos commented May 20, 2012

Uh oh!

gagle commented May 20, 2012

Uh oh!

moos commented May 20, 2012

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants