momiji_generator

Overview

momiji_generator is a Python tool designed to process and generate image-text interleaved data. It can fetch HTML from URLs to create interleaved data, including text and text_list fields, and is also designed to work with the MOMIJI dataset by processing entries and generating these fields from source URLs.

Features

Downloads HTML content from specified URLs to generate image-text interleaved data, including text and text_list fields.
Processes entries from the MOMIJI dataset, assuming text and text_list fields will be populated from source URLs.

Requirements

Python (specify version if known, e.g., Python 3.8+)
(Potentially other dependencies, like requests for fetching URLs, beautifulsoup4 for parsing HTML, etc. - please specify if known)
uv (recommended for running scripts, as shown in usage examples)

Installation

To set up the project and install dependencies, use uv:

uv sync

This command will create a virtual environment (if one doesn't exist) and install all the necessary packages specified in your project's configuration (e.g., pyproject.toml or requirements.txt if uv is configured to use them).

Make sure you have uv installed. If not, you can find installation instructions here.

Usage

The primary way to use momiji_generator is via its command-line interface.

Example: Retrieving text from a MOMIJI dataset file

uv run python retrieve_text_from_jsonl.py --input_path 1707947473360.9.jsonl.gz --output_path 1707947473360.9.jsonl --limit 20

This command processes a gzipped JSONL file (presumably from the MOMIJI dataset), extracts relevant text fields, and saves the output to 1707947473360.9.jsonl, limiting the processing to the first 20 entries.

Example: Using with Docker

If you prefer to use Docker, you can build an image and run the tool in a container.

Build the Docker image:
```
docker build -t momiji_generator .
```
Run the container:

This example mounts the current directory ($(pwd)) to /data_io inside the container, allowing the script to read input from and write output to your host machine.
```
docker run --rm \
  -v "$(pwd):/data_io" \
  momiji_generator \
  --input_path /data_io/1707947473360.9.jsonl.gz \
  --output_path /data_io/new_output.jsonl \
  --limit 5
```
This command will process 1707947473360.9.jsonl.gz from your current directory and save the output as new_output.jsonl in the same directory, processing up to 5 entries.

Example: Generating interleaved data from a URL

# Add command example for URL processing here if applicable
# e.g., uv run python generate_from_url.py --url https://example.com/article --output_path output.jsonl

Data Source: MOMIJI Dataset

This tool is designed to work with the MOMIJI (Modern Open Multimodal Japanese filtered Dataset). MOMIJI is a large-scale public dataset of image-text–interleaved web documents.

Important Note on MOMIJI Dataset License and Usage: The MOMIJI dataset is licensed under CC-BY-4.0. However, its use is strictly limited to "information analysis" as defined in Article 30-4 of the Japanese Copyright Act. This means it can be used for non-consumptive machine processing (like training AI models, research) but not for human enjoyment of the works. Please refer to the Hugging Face dataset card for detailed permitted and prohibited uses.

Contributing

Contributions are welcome! 🎉 Please feel free to:

Open an issue to report bugs or suggest features.
Submit a pull request with improvements.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
pyproject.toml		pyproject.toml
retrieve_text_from_jsonl.py		retrieve_text_from_jsonl.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

momiji_generator

Overview

Features

Requirements

Installation

Usage

Example: Retrieving text from a MOMIJI dataset file

Example: Using with Docker

Example: Generating interleaved data from a URL

Data Source: MOMIJI Dataset

Contributing

About

Uh oh!

Releases

Packages

Languages

turingmotors/momiji_generator

Folders and files

Latest commit

History

Repository files navigation

momiji_generator

Overview

Features

Requirements

Installation

Usage

Example: Retrieving text from a MOMIJI dataset file

Example: Using with Docker

Example: Generating interleaved data from a URL

Data Source: MOMIJI Dataset

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages