Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add std.string.splitterLines() #2987

Merged
merged 1 commit into from
Feb 15, 2015

Conversation

WalterBright
Copy link
Member

Melds together string.splitLines() and algorithm.splitter() to make a line splitter that doesn't allocate memory and accepts ranges as input instead of just arrays.

@WalterBright WalterBright force-pushed the string-splitterLines branch 2 times, most recently from 83f16d1 to 003ec29 Compare February 15, 2015 03:40

@property Range front()
{
if (iStart == _unComputed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of superlong if statements - this is virtually the entire function. Have it call a helper function? Or make an early return?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of branch prediction, the 'then' is the hot case, and this is why it is organized as it is. Performance! The 'then' is complex enough that as a helper function, it would not get inlined.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You won't be able to measure a difference for this one test in such a large function.

Functions consisting of essentially one if are the pits. The reader sees the beginning of a bulky if and goes, "boy that's some good long if statement, surely this function will do many wonderful interesting things long after this if is done". So the reader puts the if on the mental stack and goes on to read the large body of code inside the if, to just realize much later that all that was for naught - there's nothing after if so it needn't be remembered in the first place!

This is not good style, pure and simple. It's not reason enough to reject this code, but you'd do good to refactor it and anyone coming later would do good to refactor it.

@burner
Copy link
Member

burner commented Feb 15, 2015

can you transform the nogc test into a doc unittest. Examples always help.

Does not throw on invalid UTF; such is simply passed unchanged
to the output.

Does not allocate memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a native english speaker, but I believe both "Does .." sentences need a subject.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually a common usage when describing how something works. The subject is implied -- though we don't use it in conversation.

@Panke
Copy link
Contributor

Panke commented Feb 15, 2015

LGTM.

{
auto s = "\rpeter\n\rpaul\r\njerry\u2028ice\u2029cream\n\nsunday\nmon\u2030day\n";
auto lines = s.splitterLines();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An iteration would be nice here - after that we can pull this.

static immutable witness = ["peter", "paul", "jerry", "\u2028ice\u2029cream", "sunday", "mon\u2030day"];
uint i;
foreach (line; lines)
{
    assert(line == witness[i++]);
} 
assert(i == 6);

@WalterBright WalterBright force-pushed the string-splitterLines branch 2 times, most recently from 83967e8 to 18336ad Compare February 15, 2015 22:12
@andralex
Copy link
Member

Auto-merge toggled on

andralex added a commit that referenced this pull request Feb 15, 2015
@andralex andralex merged commit 59250ec into dlang:master Feb 15, 2015
@WalterBright WalterBright deleted the string-splitterLines branch February 16, 2015 00:03
@schuetzm
Copy link
Contributor

I see it's already merged, but I'd like to bikeshed: splitterLines sounds awkward. How about linesSplitter or lineSplitter?

@Poita
Copy link
Contributor

Poita commented Feb 16, 2015

Have to agree, the name really isn't great. Can we overload byLine? It shouldn't conflict, right?

@Hackerpilot
Copy link
Member

I also think that lineSplitter would be more clear. Walter even called it "line splitter" in his original comment.

@quickfur
Copy link
Member

👍 for lineSplitter. splitterLines sounds really poor.

@JakobOvrum
Copy link
Member

Requiring length seriously limits this algorithm. It can't support infinite ranges, nor ranges with unknown length, such as data coming from a socket.

@kuettler
Copy link
Contributor

It requires slicing. Are there ranges without length that support slicing? Without allocating memory? There is splitLines and std.algorithm.splitter for such purposes, right?

@kuettler
Copy link
Contributor

Also, given the functions splitLines and std.algorithm.splitter, the name splitterLines might not be so bad. I like to think of range functions as functions rather than object. lineSplitter is not a nice name for a function.

@schuetzm
Copy link
Contributor

@kuettler Infinite ranges can support slicing, but they don't have a (finite) length.

@ntrel
Copy link
Contributor

ntrel commented Feb 17, 2015

@kuettler We already have splitter which is a function, it acts like a constructor for a type. Assuming that name is OK, lineSplitter seems logical for this pull.

@Poita:

Can we overload byLine? It shouldn't conflict, right?

Maybe, but std.stdio.byLine doesn't handle all kinds of newline separators (you have to pick only one), so IMO it would be confusing to have the same name for things that have different semantics.

{
private:
Range _input;
alias IndexType = typeof(unsigned(_input.length));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WalterBright what is the reason for this alias? is size_t not suitable here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's murky territory. For example, the length property is allowed to always return ulong (according to hasLength), which breaks code that assumes size_t when the target is a 32-bit platform. Yet many range algorithms and higher-order ranges treat the length property as being size_t. I don't think it's a major problem, but in the long term we need to settle on which way to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.