Running paragraph level deduplication on c4

I am trying to run paragraph level deduplication using the dolma library and wanted to test it on c4. I downloaded `allenai/c4` from huggingface, updated the schema to be `text (string, doc content), id (long, unique id), source ("c4")`, and saved it as `json.gz` files that are `~250MB/file`. Any time I run `dolma -c c4-dedupe.yaml dedupe` the output attribute is always an empty list. Here is the `yaml` I am using (which is almost identical to the one provided at `configs/dolma-v1_5/para_dedupe/c4.yaml`

```
documents:
  - /home/c4/v0/documents/*.gz

dedupe:
  name: dedupe_paragraphs
  paragraphs:
    attribute_name: bff_duplicate_paragraph_spans
  skip_empty: true

bloom_filter:
  file: /tmp/c4.bloom
  read_only: false
  estimated_doc_count: 30000000000
  desired_false_positive_rate: 1e-06

processes: 350
```

the machine I am using has `360 vCPU` and is running `Debian 11, Python 3.10`. I tried using `pip install dolma` and downloading the library directly from the repo (neither worked). I built a small example input as I saw in [this discussion](https://github.com/allenai/dolma/issues/96) which worked totally fine. Pretty confused about this result. 

I would really appreciate help / any thoughts why this might be the case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running paragraph level deduplication on c4 #150

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Running paragraph level deduplication on c4 #150

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions