Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I don't know why it takes so long for this case. #911

Closed
loia5tqd001 opened this issue Jan 12, 2022 · 9 comments
Closed

I don't know why it takes so long for this case. #911

loia5tqd001 opened this issue Jan 12, 2022 · 9 comments

Comments

@loia5tqd001
Copy link

Papaparse can parse a 50MB file in about 1 second in my machine. It's great.

However, I'm not sure what is special about this csv file but it's just 400KB but takes 60 seconds to parse, and the whole browser was stalled meanwhile.

I've tried to tweak some settings and looks like the problem will only occur if:

  • We don't enable Stream option
  • We enable the Skip Empty Lines option

Changing any of those will make it work normally again. I'm just wondering why those options affect the performance so significantly.

https://www.papaparse.com/demo

image

evil_csv_file.csv

@bezrodnov
Copy link
Contributor

empty lines are skipped using data.splice(rowNumber, 1) which can be a root cause - removing X elements one by one in a loop over an array having a few thousands/millions records will take some time (and probably memory too?).

@loia5tqd001
Copy link
Author

Thanks for your response @bezrodnov do you think this can be resolved differently without data.splice(rowNumber, 1) to fix this issue? Processing a file with a lot of empty lines is also a weird use case I guess. Like "are you crazy user? go f* yourself" lol. Cause it will only affect their browser.

@bezrodnov
Copy link
Contributor

Hi @loia5tqd001 :)
I can imagine a few solutions with better performance and I even tried them locally after cloning the repo and adding a few test cases to reproduce this issue.

I'll probably open a PR to fix this issue when I am confident enough in one of these solutions (or maybe find a better one).

			if (_config.skipEmptyLines)
			{
				for (var i = 0; i < _results.data.length; i++)
					if (testEmptyLine(_results.data[i]))
						_results.data.splice(i--, 1);
				
				// FIX 1
				// _results.data = _results.data.filter(function(d) {
				// 	return !testEmptyLine(d);
				// });
				
				// FIX 2
				// var nonEmptyRows = [];
				// for (var i = 0; i < _results.data.length; i++) {
				// 	var row = _results.data[i];
				// 	if (!testEmptyLine(row)) {
				// 		nonEmptyRows.push(row);
				// 	}
				// }
				// _results.data = nonEmptyRows;


				// FIX 3
				// var nonEmptyRows = new Array(_results.data.length);
				// var j = 0;
				// for (var i = 0; i < _results.data.length; i++) {
				// 	if (!testEmptyLine(_results.data[i])) {
				// 		nonEmptyRows[j++] = _results.data[i];
				// 	}
				// }
				// nonEmptyRows.length = j;
				// _results.data = nonEmptyRows;
			}

P.S. unfortunately the code is not written as per latest ES standards so I filtered out some solutions cause this library should work properly in different environments. E.g. I am not sure if Array.filter is acceptable here.

@pokoli
Copy link
Collaborator

pokoli commented Jan 12, 2022

Which are the requeriments of Array.filter? I think it will be the best solution but we need to keep compatibility with plain javascript.

@bezrodnov
Copy link
Contributor

BTW we are using papaparser in our project and we've faced exactly the same issue on production recently.
And I can confirm that the code gets stuck in this "for-with-array-splice" loop - in our case the number of elements in the array was slowly decreasing from 1+ million to ~200 records (which took more than 10 minutes). And I guess MS Excel could cause the file to look that weird.

@loia5tqd001
Copy link
Author

That sounds great, thanks for your contribution @bezrodnov , perhaps need to wait for the maintainer's judge to conclude

@pokoli
Copy link
Collaborator

pokoli commented Jan 12, 2022

@bezrodnov I will be happy to merge a PR that fixes that. As already commented I think Array.filter is the way to go if that does not break anything. Otherwise we can consider any other solution.

@bezrodnov
Copy link
Contributor

@pokoli I've found out that Array.filter is already used once in the codebase. So I went with this solution as I personally like how it looks compared to the other fixes mentioned above 😀.
If someone reports an issue where Array.filter function is not defined we will have to change this solution and another usage of this function which should be pretty straightforward anyway (either using array.push or adding a polyfill).

P.S. I had some fun comparing the solutions - https://jsbench.me/r7kybtdq5b/1
Surprisingly the one with pushing into a new array was a bit faster than the others 😁

And just out of curiosity I tried populating an array with 1 mil random numbers using different methods: https://jsbench.me/8ukybr9f3i/1 - the fastest solution was setting the value using an index in the array with pre-allocated size - which did not surprise me, but I just don't personally like how it looks 🙂

pokoli pushed a commit that referenced this issue Mar 15, 2022
Co-authored-by: Aleks Bezrodnov <aleksei.bezrodnov@zalando.de>
@pokoli
Copy link
Collaborator

pokoli commented Mar 15, 2022

Fixed with #912

@pokoli pokoli closed this as completed Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants