Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faker.datatype.number often gives duplicates #2355

Closed
8 of 10 tasks
frankandrobot opened this issue Aug 28, 2023 · 19 comments · Fixed by #2357
Closed
8 of 10 tasks

faker.datatype.number often gives duplicates #2355

frankandrobot opened this issue Aug 28, 2023 · 19 comments · Fixed by #2357
Assignees
Labels
c: bug Something isn't working has workaround Workaround provided or linked m: helpers Something is referring to the helpers module m: number Something is referring to the number module p: 1-normal Nothing urgent
Milestone

Comments

@frankandrobot
Copy link

Pre-Checks

Describe the bug

  • In unit tests running on our CI environment (and occasionally on our macs), the following code has been observed to throw a "duplicate error"---where it cannot find a unique integer
  • fakeId is called 50--100 times in total on the offending test runs.
  • the rate of failure on CI is around 1 in 1000 runs
  • haven't done the math but it is still very surprising because we are generating up the "max safe integer"... which is equal to 2^53 – 1... a very, very large number, so I don't expect 100 generations to render duplicates
export const fakeId = () =>
  faker.unique(() => faker.datatype.number({ max: Number.MAX_SAFE_INTEGER }), undefined, { maxTime: 100 }).toString();

Analysis

  • The mersenne implementation is very sus---why use this at all? For the most part, fakerjs seems to be a test helper, so IMO, this is over engineering. We can likely just use Math.random under the hood
  • Additionally, the implementation of unique is also sus on CI environments. Namely, this block:
    https://github.com/faker-js/faker/blob/next/src/modules/helpers/unique.ts#L114-L122
  • Most likely what's happening is the following---in our CI environments, we have a known issue with slowness. The mersenne implementation will unnecessarily return duplicates. And when both happen, now - startTime will be greater than maxTime, resulting in the thrown error

Suggestion:

  • A workaround we can do on our end is pass a very large maxTime.
  • but unique should rely on maxRetries only
  • faker should have an option to opt-out of using mersenne

Minimal reproduction code

  • see above

Additional Context

No response

Environment Info

OS: linux/macosx
Node: 16
faker: 7.3.0

Which module system do you use?

  • CJS
  • ESM

Used Package Manager

npm

@frankandrobot frankandrobot added c: bug Something isn't working s: pending triage Pending Triage labels Aug 28, 2023
@ST-DDT
Copy link
Member

ST-DDT commented Aug 28, 2023

I used the following code:

import { faker } from "@faker-js/faker";

const set = new Set();
const limit = 10_000_000;

for (let i = 0; i < limit; i++) {
  set.add(faker.number.int());
  //if (i % 100_000 === 0) {
  //  console.log(i);
  //}
}


console.log("=====================================");
console.log("Runs:  ", limit);
console.log("Values:", set.size);
console.log("Ratio: ", set.size / limit);

Which returned the following values:

Runs: 10000000
Values: 9988326
Ratio: 0.9988326

Which is 12 duplicates per 10k generated values.
This should not significantly slow down the generation process (unless you happen to run into a GC pause).


I used your code above to count the number of errors:

console.warn = () => {};

import { faker } from "@faker-js/faker";

const limit = 5_000_000;
let counter = 0;

for (let i = 0; i < limit; i++) {
  try {
    faker.helpers.unique(
      () => faker.number.int({ max: Number.MAX_SAFE_INTEGER }),
      undefined,
      { maxTime: 100 }
    );
  } catch (e) {
    counter++;
  }
  //if (i % 100_000 === 0) {
  //  console.log(i);
  //}
}

console.log("=====================================");
console.log("Runs:  ", limit);
console.log("Errors:", counter);
console.log("Ratio: ", counter / limit);

And I didn't get any in 5M executions, I didn't run it on a slim CI machine though.

So I assume it is caused by a slow CI pipeline host/GC pause.


  • A workaround we can do on our end is pass a very large maxTime.

This is the only workaround that I can currently think of except from implementing unique yourself.
I'm not sure what you consider large but I assume it is dependent on the number of parallel processes/tests running on the pipeline host and it's computing power.
We also have tests for the unique method implementation and while we do see some rare failures, these specifically seem to occur on the GitHub MacOs runners. So these are either especially slow or have a NodeJs bug/issue that wastes time.
Please make sure you don't reset your seed to a static value causing the same values to be generated over and over again.

Have you tried using Faker v8?

  • but unique should rely on maxRetries only

We aren't happy with the current implementation either.
We plan to remove/refactor the method (probably in v9).

The mersenne implementation is very sus---why use this at all? For the most part, fakerjs seems to be a test helper, so IMO, this is over engineering. We can likely just use Math.random under the hood

  • faker should have an option to opt-out of using mersenne

Faker uses mersenne to generate reproducible values,
reproducible values are important for certain test cases and test debugging.
We have an issue + PR to add support for custom generators:

@ST-DDT ST-DDT added has workaround Workaround provided or linked m: helpers Something is referring to the helpers module labels Aug 28, 2023
@xDivisionByZerox xDivisionByZerox removed the s: pending triage Pending Triage label Aug 28, 2023
@Shinigami92
Copy link
Member

Shinigami92 commented Aug 29, 2023

Please also read #1785 for more details and potentially you can already switch now to one of the new unique implementations out there, or even just copy-paste my suggestion and alter it to your own needs.
The "good" thing is that unique is and never was coupled to Faker itself, because it does not depend on the mersenne/seed.

@matthewmayer
Copy link
Contributor

matthewmayer commented Aug 29, 2023

It does seem like faker.number.int returns more duplicates than expected for large max. If every int was equally likely to be selected, you would expect fewer duplicates as the max increased. However in fact for max above 10^10 there doesn't seem to be an increase in the runs needed before the first repeated value, it occurs after approx 80,000 values on average. Note than Number.MAX_SAFE_INTEGER is of the magnitude 10^15.

const {
  faker
} = require("@faker-js/faker");
const attempts = 100
for (let pow = 2; pow < 15; pow++) {
  const max = 10 ** pow;
  let total = 0
  for (let attempt = 0; attempt < attempts; attempt++) {
    total += timeToFirstDuplicate(max)
  }
  let average = Math.round(total / attempts)
  console.log("First duplicate for faker.number.int({max:10^" + pow + "}) occurs after avg " + average)
}

function timeToFirstDuplicate(max) {
  const set = new Set();
  let duplicates = false
  let count = 0

  do {
    const j = faker.number.int({
      max
    });
    if (set.has(j)) {
      duplicates = true
    } else {
      set.add(j)
      count++
    }
  } while (!duplicates)
  return count
}
First duplicate for faker.number.int({max:10^2}) occurs after avg 12
First duplicate for faker.number.int({max:10^3}) occurs after avg 38
First duplicate for faker.number.int({max:10^4}) occurs after avg 114
First duplicate for faker.number.int({max:10^5}) occurs after avg 394
First duplicate for faker.number.int({max:10^6}) occurs after avg 1250
First duplicate for faker.number.int({max:10^7}) occurs after avg 4205
First duplicate for faker.number.int({max:10^8}) occurs after avg 12820
First duplicate for faker.number.int({max:10^9}) occurs after avg 37629
First duplicate for faker.number.int({max:10^10}) occurs after avg 81947
First duplicate for faker.number.int({max:10^11}) occurs after avg 78749
First duplicate for faker.number.int({max:10^12}) occurs after avg 85969
First duplicate for faker.number.int({max:10^13}) occurs after avg 83313
First duplicate for faker.number.int({max:10^14}) occurs after avg 79924

Perhaps the way we have mersenne set up means the number of possible outputs is less than expected?

@matthewmayer
Copy link
Contributor

This makes sense, as mersenne twister implementation returns around 2^32 seperate values which is approx 10^10, several orders of magnitude less than Number.MAX_SAFE_INTEGER

@matthewmayer
Copy link
Contributor

matthewmayer commented Aug 29, 2023

If you uniformly draw a random number uniformly from [1…N] with replacement the expected number of draws before you get a repeat is approx sqrt(pi/2*N) so for a mersenne twister with 2^32 values we should expect a repeat approximately every sqrt(pi/2*2^32) = 82,137 trials.

Apologies for the mathematical diversion :D

@Shinigami92
Copy link
Member

Send from my phone: what about multiplying faker.number.int() with faker.number.int(min: 1, max: 25000) 👀
This would increase the range but still is deterministic

// (2**32/2)/83000≈25000

We could move this multiplication also into our source to adjust it internally 🤔

But first let's hear the ideas from the others

@ST-DDT
Copy link
Member

ST-DDT commented Aug 29, 2023

A little loss in precision is to be expected in algorithmic randomness.
IMO we should check whether there are certain values that are causing the duplicates or whether the duplicates are randomly distributed.

Also we could check if we could use a different access method from our twister. AFAICT there is one with higher precision or at least is says so.

@ST-DDT
Copy link
Member

ST-DDT commented Aug 29, 2023

I tried switching from genrandReal2 to genrandRes53.

And the number of duplicates went from ~ $1 \over 10000$ to 0 (at 8kk runs).

Test-Code
import MersenneTwister19937 from "./twister";

const twister = new MersenneTwister19937();
twister.initGenrand(Math.random() * Number.MAX_SAFE_INTEGER);

const limit = 8_300_000; // JS object key limit!?
const data: Record<number, number> = {};

for (let i = 0; i < limit; i++) {
  const random = twister.genrandReal2();
  // const random = twister.genrandRes53();
  data[random] = (data[random] || 0) + 1;

  if (i % 100_000 === 0) {
    console.log("Runs:  ", i);
  }
}
console.log("Runs:  ", limit);
console.log("=====================================");
console.log("Calculating...");

const duplicateCount = Object.values(data).reduce((a, b) => a + b - 1, 0);
const duplicateMax = Object.values(data).reduce((a, b) => Math.max(a, b), 0) - 1;

console.log("=====================================");
console.log("Runs:      ", limit);
console.log("Duplicates:", duplicateCount);
console.log("Ratio:     ", duplicateCount / limit);
console.log("Max:       ", duplicateMax);
console.log("=====================================");

@ST-DDT
Copy link
Member

ST-DDT commented Aug 29, 2023

I'll have a look at the bit magic in the implementation, when i have more time.

@ST-DDT
Copy link
Member

ST-DDT commented Aug 29, 2023

I ran 500kkk invocations for faker.number.float() and it returns [0,1) wheres all the other methods return [0,max].
I also checked MersenneTwister19937.genrandInt32() and it returns [0,2^32) instead of [0,2^32] as described (checked with 225kkk invocations).

It looks like #1675 might have contributed to this.

@ST-DDT ST-DDT added the p: 2-high Fix main branch label Aug 29, 2023
@ST-DDT ST-DDT self-assigned this Aug 29, 2023
@ST-DDT ST-DDT added the s: needs decision Needs team/maintainer decision label Aug 29, 2023
@ST-DDT
Copy link
Member

ST-DDT commented Aug 29, 2023

Is [0,1) or [0,1] supposed to be the correct behavior for number.float()?

IMO [0,1] is the correct version.

@ST-DDT
Copy link
Member

ST-DDT commented Aug 29, 2023

faker.int.number() only returns even values when using very large max values.
This is caused by the float() being truncated after 17 places + loosing a few due to math.
Min ~ 0b00000000000000000000000000001000000000000000000000000
Max ~ 0b11111111111111111111111111111111000000000000000000000

@xDivisionByZerox
Copy link
Member

Is [0,1) or [0,1] supposed to be the correct behavior for number.float()?

IMO [0,1] is the correct version.

I think [0, 1) is the correct one. Math.random() also returns values in that range.

@ST-DDT ST-DDT added the m: number Something is referring to the number module label Aug 29, 2023
@ST-DDT
Copy link
Member

ST-DDT commented Aug 29, 2023

I created #2357 to start working on a fix.

@ST-DDT
Copy link
Member

ST-DDT commented Aug 29, 2023

Is [0,1) or [0,1] supposed to be the correct behavior for number.float()?
IMO [0,1] is the correct version.

I think [0, 1) is the correct one. Math.random() also returns values in that range.

All our other methods return [min,max] though and AFAICT faker.number.float({precision}) does that as well (because it uses int() internally).

@ST-DDT
Copy link
Member

ST-DDT commented Aug 29, 2023

With the new PR we could introduce a new option maxExclusive: boolean = false or exclusiveMax though.

@xDivisionByZerox
Copy link
Member

With the new PR we could introduce a new option maxExclusive: boolean = false or exclusiveMax though.

I don't like this. Just define what the function does and be done. String.prototype.substring doesn't have a option includeLastCharacter either.

@ST-DDT
Copy link
Member

ST-DDT commented Aug 29, 2023

If someone needs that they should open a new feature request then.

@ST-DDT ST-DDT removed the s: needs decision Needs team/maintainer decision label Aug 31, 2023
@ST-DDT ST-DDT added p: 1-normal Nothing urgent and removed p: 2-high Fix main branch labels Aug 31, 2023
@ST-DDT
Copy link
Member

ST-DDT commented Aug 31, 2023

Team Decision

We will change mersenne to 53 bit precision to reduce the duplicates and create a separate PR for the float issues.

@ST-DDT ST-DDT modified the milestones: v8.x, v9.x, v9.0 Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: bug Something isn't working has workaround Workaround provided or linked m: helpers Something is referring to the helpers module m: number Something is referring to the number module p: 1-normal Nothing urgent
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants