Skip to content

scoopeng/Realm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

59 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Realm MongoDB Export Utility v2.0

A sophisticated two-phase MongoDB to CSV export system with intelligent field discovery, relationship expansion, and human-editable configuration.

πŸŽ‰ What's New in v2.0

  • Automatic Relationship Discovery: Zero hardcoding - system tests ObjectIds against actual collections
  • Primary Mode for Arrays: Extract clean values from first array element (names, emails, phones)
  • Count Mode for Arrays: Get array lengths for all arrays
  • Hierarchical Audit Trees: Visual representation of all field expansions
  • Smart Field Detection: Uses patterns to identify useful fields generically
  • 100% Data-Driven: Works with ANY MongoDB database without configuration

πŸš€ Quick Start

# Phase 1: Discover all fields and create configuration
./gradlew discover -Pcollection=listings

# Phase 2: Export data using the configuration
./gradlew configExport -Pcollection=listings

The configuration file (config/listings_fields.json) can be edited between phases to customize the export.

✨ Key Features

Two-Phase Workflow

  • Discovery Phase: Analyzes your MongoDB collection to discover all fields, relationships, and statistics
  • Export Phase: Uses the configuration to export exactly the fields you want

Intelligent Field Discovery

  • Automatically discovers all fields including nested documents and arrays
  • Expands foreign key relationships up to 3 levels deep
  • Collects statistics on field usage and distinct values
  • Filters out empty and single-value fields automatically

Human-Editable Configuration

  • JSON configuration can be manually edited between phases
  • Control which fields to include/exclude
  • Customize business names for columns
  • Configure array display (first value or comma-separated list)

Smart Array Handling

  • Primary Mode: Extracts clean values from first array element (names, emails, phones)
  • Count Mode: Provides array lengths for all arrays
  • Automatic Relationship Discovery: Zero hardcoding, tests ObjectIds against actual collections
  • Automatically detects the best field to extract from array objects
  • Sorts array values alphanumerically
  • Configurable display modes per field

πŸ“‹ Usage Examples

Basic Two-Phase Export

# Step 1: Discover fields
./gradlew discover -Pcollection=listings
# Creates: config/listings_fields.json (configuration)
#          config/listings_expansion_audit.txt (visual audit tree)

# Step 2: (Optional) Review and edit configuration
vi config/listings_fields.json
# Review the audit tree to verify expansions:
cat config/listings_expansion_audit.txt

# Step 3: Export data
./gradlew configExport -Pcollection=listings

# Optional: Export with row limit for testing
./gradlew configExport -Pcollection=listings -ProwLimit=1000

Working with Different Collections

# Transactions
./gradlew discover -Pcollection=transactions
./gradlew configExport -Pcollection=transactions

# Agents
./gradlew discover -Pcollection=agents
./gradlew configExport -Pcollection=agents

πŸ“„ Configuration File

The discovery phase creates a JSON configuration file with this structure:

{
  "collection": "listings",
  "discoveredAt": "2025-08-11T10:00:00Z",
  "discoveryParameters": {
    "sampleSize": 10000,
    "expansionDepth": 3,
    "minDistinctNonNullValues": 2
  },
  "fields": [
    {
      "fieldPath": "mlsNumber",
      "businessName": "MLS Number",
      "dataType": "string",
      "include": true,
      "statistics": {
        "distinctNonNullValues": 9875,
        "nullCount": 125
      }
    },
    {
      "fieldPath": "listingAgents",
      "businessName": "Listing Agents",
      "dataType": "array",
      "include": true,
      "arrayConfig": {
        "objectType": "objectId",
        "referenceCollection": "agents",
        "extractField": "fullName",
        "availableFields": ["createdAt", "fullName", "lastUpdated", "privateURL"],
        "displayMode": "comma_separated",
        "sortOrder": "alphanumeric"
      }
    }
  ],
  "requiredCollections": ["properties", "agents"],
  "exportSettings": {
    "batchSize": 5000,
    "useBusinessNames": true
  }
}

Customizing the Configuration

  • Exclude a field: Set "include": false
  • Change column name: Edit "businessName"
  • Array display: Change "displayMode" to "first" or "comma_separated"
  • Array field extraction: Change "extractField" to any value from "availableFields"
  • See available options: Check "availableFields" array to see all possible fields you can extract

πŸ”§ Installation

Prerequisites

  • Java 11 or higher
  • MongoDB connection
  • 16GB+ RAM recommended

Setup

  1. Clone the repository
  2. Configure MongoDB connection in application.properties:
    mongodb.url.dev=mongodb://username:password@host:port/?authSource=admin
    current.environment=dev
    database.name=realm
  3. Build the project: ./gradlew build

πŸ“Š Field Filtering Rules

The discovery phase automatically applies these intelligent rules:

Rule Description Example
Include Business IDs Always included if they have data mlsNumber, listingId, transactionId
Exclude Technical IDs Always excluded _id, __v, fields ending with Id
Exclude Empty Fields 0 distinct non-null values Fields that are always null
Exclude Single-Value Only 1 distinct value Fields like status="active" everywhere
Include Multi-Value 2+ distinct values Normal data fields

βœ… Testing Checklist

Basic Workflow Testing

# 1. Compile the project
./gradlew build

# 2. Run discovery
./gradlew discover -Pcollection=listings

# 3. Check configuration file
cat config/listings_fields.json | jq . | head -50

# 4. Run export
./gradlew configExport -Pcollection=listings

# 5. Verify output
ls -lh output/*.csv

Configuration Editing Test

# 1. Edit configuration
vi config/listings_fields.json
# - Set some fields to "include": false
# - Change some businessName values
# - Modify array displayMode settings

# 2. Re-run export
./gradlew configExport -Pcollection=listings

# 3. Verify changes in output

Advanced Testing

  • Test with all collections (listings, transactions, agents)
  • Test with sparse collections
  • Verify relationship expansion
  • Check array field handling

🚨 Troubleshooting

Common Issues and Solutions

Issue Solution
Discovery fails Check MongoDB URL in application.properties
No config file Ensure discovery completed successfully
Missing fields Check include flag in JSON configuration
Memory errors Increase heap in build.gradle: -Xmx24g
Empty arrays Verify extractField in array configuration

Debug Commands

# Check logs
tail -f logs/application.log

# Verify MongoDB connection
mongo $MONGO_URL --eval "db.listings.count()"

# Check configuration
jq '.fields[] | select(.include==false)' config/listings_fields.json

⚑ Performance

  • Discovery Phase: ~2-3 minutes for 10,000 document sample
  • Export Phase: 3,500-5,000 documents/second (varies with expansion)
  • Memory Usage: 16-24GB heap recommended
  • Collection Caching: Auto-caches collections <100K documents

πŸ“„ Output Files

CSV Export Format

  • Standard: RFC 4180 compliant CSV format
  • Quoting: Fields containing commas, quotes, or newlines are quoted
  • Escaping: Quotes within fields are escaped by doubling ("")
  • Line endings: CRLF (\r\n) as per RFC 4180
  • Encoding: UTF-8

Export Summary File

Each export generates a {collection}_summary.json file containing:

  • Field-level statistics (null counts, unique values, sample data)
  • Field categorization (ALWAYS_EMPTY, SINGLE_VALUE, MEANINGFUL)
  • Value distributions for meaningful fields
  • Export metadata (processing time, document count)

πŸ“ Project Structure

src/main/java/com/example/mongoexport/
β”œβ”€β”€ config/                        # Configuration classes
β”‚   β”œβ”€β”€ FieldConfiguration.java    # Individual field metadata
β”‚   └── DiscoveryConfiguration.java # Root configuration
β”œβ”€β”€ discovery/
β”‚   └── FieldDiscoveryService.java # Field discovery and audit logic
β”œβ”€β”€ export/
β”‚   └── ConfigurationBasedExporter.java # Config-based export
β”œβ”€β”€ DiscoveryRunner.java          # Discovery entry point
└── ConfigExportRunner.java        # Export entry point

config/                            # Configuration files
β”œβ”€β”€ {collection}_fields.json      # Editable field configuration
└── {collection}_expansion_audit.txt # Visual expansion tree

output/                            # Export results
└── {collection}_ultra_comprehensive_{timestamp}.csv

πŸ”„ Version History

v2.0 (Current) - Two-Phase Workflow

  • Separated discovery and export phases
  • Human-editable JSON configuration
  • Enhanced array field handling with reference lookups
  • Improved collection caching
  • Shows available fields for user configuration

πŸ“ License

Private repository - Internal use only

πŸ†˜ Support

For issues or questions:

  1. Check the testing checklist above
  2. Review CLAUDE.md for detailed documentation
  3. Contact the development team

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •