luvdatacomp/prompt.txt

# DataComp Cluster Quality Classifier for CLIP Model Training
Classify image-text clusters as either "Acceptable" or "Unacceptable" for training high-quality CLIP (Contrastive Language-Image Pre-training) models. These models learn visual concepts from natural language supervision. Focus on clusters that contain diverse, informative, and high-quality samples that would benefit CLIP's ability to understand and connect images with text.

## Key Considerations
1. Diversity: Prioritize clusters with a wide range of subjects, scales, and perspectives.
2. Informative Content: Favor clusters with educational value and rich descriptive text.
3. Quality: Look for clusters likely to contain high-quality imagery based on descriptions.
4. Avoid Repetition: Be cautious of clusters with near-duplicate or highly similar entries.
5. Ethical Concerns: Be wary of clusters potentially containing explicit content or harmful stereotypes.
6. Long-tail Knowledge: Consider the value of specialized knowledge, but be cautious of overly niche content.

## Examples
### Example 1
Cluster data:
```json
{{
  "closest_samples": [
    "Detailed landscape photography of mountains",
    "Close-up of a rare flower species",
    "Historic architecture in European cities",
    "Portrait of a person from an indigenous culture",
    "Microscopic image of a unique cellular structure"
  ],
  "furthest_samples": [
    "Abstract digital art",
    "Underwater photography of coral reefs",
    "Aerial view of urban sprawl",
    "Close-up of a mechanical watch movement",
    "Traditional cuisine from different cultures"
  ]
}}
```
Reason: Highly diverse content covering various subjects, scales, and perspectives. Rich descriptive text suggests high-quality imagery. Educational value across multiple domains.
Classification: Acceptable

### Example 2
Cluster data:
```json
{{
  "closest_samples": [
    "IMG_001.jpg",
    "IMG_002.jpg",
    "IMG_003.jpg",
    "IMG_004.jpg",
    "IMG_005.jpg"
  ],
  "furthest_samples": [
    "IMG_100.jpg",
    "IMG_101.jpg",
    "IMG_102.jpg",
    "IMG_103.jpg",
    "IMG_104.jpg"
  ]
}}
```
Reason: Generic filenames without descriptive content. Likely low-quality or repetitive images. Provides no useful text-image pairs for CLIP training.
Classification: Unacceptable

### Example 3
Cluster data:
```json
{{
  "closest_samples": [
    "Quantum computing explained with visual diagrams",
    "Renewable energy technologies comparison infographic",
    "Ancient civilization archaeological discoveries photos",
    "Cutting-edge medical research findings illustrated",
    "Innovative sustainable architecture designs renderings"
  ],
  "furthest_samples": [
    "Renaissance art masterpieces high-resolution scans",
    "Modern dance performance action shots",
    "Culinary techniques from around the world step-by-step photos",
    "Space exploration milestones historic images",
    "Breakthrough in artificial intelligence conceptual illustrations"
  ]
}}
```
Reason: Extremely diverse and informative content covering various fields of science, technology, culture, and arts. Rich descriptive text suggests high-quality visuals. Excellent material for training CLIP on complex concepts across multiple domains.
Classification: Acceptable

## Cluster to Classify
Cluster data:
```json
{{
  "closest_samples": {closest_samples},
  "furthest_samples": {furthest_samples}
}}
```
Reason: 
Classification: