Duplicates and multiple versions of samples

Dear authors,
while processing the MMC4 dataset, we found some anomalies and we hope you can comment on or explain these.

### Our Expectations

- There is one full large dataset (`mmc4`) that includes samples with face detections and there are several subsets of that large dataset that have been filtered:
    - One subset that contains only the samples without face detections (`mmc4-ff`) (public)
    - One subset that contains only the "core" i.e. samples with strict filtering (`mmc4-core`)
    - One subset that contains only the intersection of all these (`mmc4-core-ff`) (public)
- We assume that those are true subsets, e.g. every sample in `mmc4-core-ff` would also be contained in `mmc4-ff` etc.
- We assume that within each of the subsets, every sample is unique
	- Means each web page on the internet resulted in at most one sample
	- Of course different web pages under the same domain could result in multiple samples

### Our Findings
We found that
- each of the subsets seems to contain many exact duplicate samples up to a rate of 1-2% of all samples
- some samples occur multiple times in different subsets but slightly changed, for example with more images or with different similarity measures
- some subsets don't seem to be true subsets but instead contain samples that are not part of the corresponding larger set or the larger set contains a variant of those

### Exact Duplicates
At first, we matched samples by the MD5 hash of the JSON string to find exact duplicates.

For example for `mmc4-core-ff`, we found 5598117 total samples (i.e. json lines) among all shards, but only 5506430 unique samples.
This means that 1.6% within that subset are exact duplicates.

### Other Duplicates
If we match just by the document URL string, the duplicate rate is higher, in the case of `mmc4-core-ff` we then obtain only 5492699 unique samples, so 1.9% are duplicates.
Interestingly, the duplicates appear not just twice but up to 88 times each.

Here are the top ten duplicate URLs with the number of appearances:
```
('https://site.clubrunner.ca/page/clubrunner-mobile-app-now-available', 88),
('https://www.amazon.com.au/All-New-Kindle-With-Front-Light-Black/dp/B07FQ4DJ83', 59),
('https://www.plentygram.com/blog/how-to-make-your-instagram-account-famous/', 46),
('http://www.fuelly.com/', 41),
('https://www.bhhsnv.com/', 39),
('https://www.kikocosmetics.com/en-us/', 34),
('http://www.manchesteruniversitypress.co.uk/articles/freedom-and-the-fifth-commandment-qa-with-brian-heffernan/', 31),
('http://www.manchesteruniversitypress.co.uk/articles/mup-advent-calendar-starts-thursday/', 31),
('https://emeraldcoastbyowner.com/', 29),
('https://www.ait.com/web-development/?typhon', 29)
```

We took a closer look at the first sample with 88 duplicates and found that 87 of those are exact duplicates but 1 is slightly different.
For that 1 sample, the image similarities and the similarity matrix are different altough the text and images match with those of the other 87 samples.

### Faces vs. No Faces
We assumed that fewer faces dataset is simply a filtered version of the sets with faces.
We filtered the set with faces ourselves, keeping only the samples that have `face_detections: None`.
However, this does not result in the same set as the published fewer faces set.
This effect is related to the similar but slightly different samples mentioned above.
One example is this:
Compare `mmc4_core_faces/docs_shard_4943_v3.jsonl.zip` sample 113 with `mmc4_full_faces/docs_shard_4943_v2.jsonl.zip` sample 1523.
Both have the same URL and the core set should be a subset of the full set. However, the second sample contains an additional image with face detections, while all other images contain no face detections.

![image](https://github.com/allenai/mmc4/assets/126014612/813ecd7f-aa1e-444f-9651-3fd4cc76f716)


### Questions
- How were the 4 sets constructed by the authors?
- Are our assumptions/expectations correct?
- If there are multiple different versions of a sample (e.g. one with more images) which one is the correct one?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Duplicates and multiple versions of samples #10

Our Expectations

Our Findings

Exact Duplicates

Other Duplicates

Faces vs. No Faces

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Duplicates and multiple versions of samples #10

Description

Our Expectations

Our Findings

Exact Duplicates

Other Duplicates

Faces vs. No Faces

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions