A sophisticated two-phase MongoDB to CSV export system with intelligent field discovery, relationship expansion, and human-editable configuration.
- Automatic Relationship Discovery: Zero hardcoding - system tests ObjectIds against actual collections
- Primary Mode for Arrays: Extract clean values from first array element (names, emails, phones)
- Count Mode for Arrays: Get array lengths for all arrays
- Hierarchical Audit Trees: Visual representation of all field expansions
- Smart Field Detection: Uses patterns to identify useful fields generically
- 100% Data-Driven: Works with ANY MongoDB database without configuration
# Phase 1: Discover all fields and create configuration
./gradlew discover -Pcollection=listings
# Phase 2: Export data using the configuration
./gradlew configExport -Pcollection=listingsThe configuration file (config/listings_fields.json) can be edited between phases to customize the export.
- Discovery Phase: Analyzes your MongoDB collection to discover all fields, relationships, and statistics
- Export Phase: Uses the configuration to export exactly the fields you want
- Automatically discovers all fields including nested documents and arrays
- Expands foreign key relationships up to 3 levels deep
- Collects statistics on field usage and distinct values
- Filters out empty and single-value fields automatically
- JSON configuration can be manually edited between phases
- Control which fields to include/exclude
- Customize business names for columns
- Configure array display (first value or comma-separated list)
- Primary Mode: Extracts clean values from first array element (names, emails, phones)
- Count Mode: Provides array lengths for all arrays
- Automatic Relationship Discovery: Zero hardcoding, tests ObjectIds against actual collections
- Automatically detects the best field to extract from array objects
- Sorts array values alphanumerically
- Configurable display modes per field
# Step 1: Discover fields
./gradlew discover -Pcollection=listings
# Creates: config/listings_fields.json (configuration)
# config/listings_expansion_audit.txt (visual audit tree)
# Step 2: (Optional) Review and edit configuration
vi config/listings_fields.json
# Review the audit tree to verify expansions:
cat config/listings_expansion_audit.txt
# Step 3: Export data
./gradlew configExport -Pcollection=listings
# Optional: Export with row limit for testing
./gradlew configExport -Pcollection=listings -ProwLimit=1000# Transactions
./gradlew discover -Pcollection=transactions
./gradlew configExport -Pcollection=transactions
# Agents
./gradlew discover -Pcollection=agents
./gradlew configExport -Pcollection=agentsThe discovery phase creates a JSON configuration file with this structure:
{
"collection": "listings",
"discoveredAt": "2025-08-11T10:00:00Z",
"discoveryParameters": {
"sampleSize": 10000,
"expansionDepth": 3,
"minDistinctNonNullValues": 2
},
"fields": [
{
"fieldPath": "mlsNumber",
"businessName": "MLS Number",
"dataType": "string",
"include": true,
"statistics": {
"distinctNonNullValues": 9875,
"nullCount": 125
}
},
{
"fieldPath": "listingAgents",
"businessName": "Listing Agents",
"dataType": "array",
"include": true,
"arrayConfig": {
"objectType": "objectId",
"referenceCollection": "agents",
"extractField": "fullName",
"availableFields": ["createdAt", "fullName", "lastUpdated", "privateURL"],
"displayMode": "comma_separated",
"sortOrder": "alphanumeric"
}
}
],
"requiredCollections": ["properties", "agents"],
"exportSettings": {
"batchSize": 5000,
"useBusinessNames": true
}
}- Exclude a field: Set
"include": false - Change column name: Edit
"businessName" - Array display: Change
"displayMode"to"first"or"comma_separated" - Array field extraction: Change
"extractField"to any value from"availableFields" - See available options: Check
"availableFields"array to see all possible fields you can extract
- Java 11 or higher
- MongoDB connection
- 16GB+ RAM recommended
- Clone the repository
- Configure MongoDB connection in
application.properties:mongodb.url.dev=mongodb://username:password@host:port/?authSource=admin current.environment=dev database.name=realm
- Build the project:
./gradlew build
The discovery phase automatically applies these intelligent rules:
| Rule | Description | Example |
|---|---|---|
| Include Business IDs | Always included if they have data | mlsNumber, listingId, transactionId |
| Exclude Technical IDs | Always excluded | _id, __v, fields ending with Id |
| Exclude Empty Fields | 0 distinct non-null values | Fields that are always null |
| Exclude Single-Value | Only 1 distinct value | Fields like status="active" everywhere |
| Include Multi-Value | 2+ distinct values | Normal data fields |
# 1. Compile the project
./gradlew build
# 2. Run discovery
./gradlew discover -Pcollection=listings
# 3. Check configuration file
cat config/listings_fields.json | jq . | head -50
# 4. Run export
./gradlew configExport -Pcollection=listings
# 5. Verify output
ls -lh output/*.csv# 1. Edit configuration
vi config/listings_fields.json
# - Set some fields to "include": false
# - Change some businessName values
# - Modify array displayMode settings
# 2. Re-run export
./gradlew configExport -Pcollection=listings
# 3. Verify changes in output- Test with all collections (listings, transactions, agents)
- Test with sparse collections
- Verify relationship expansion
- Check array field handling
| Issue | Solution |
|---|---|
| Discovery fails | Check MongoDB URL in application.properties |
| No config file | Ensure discovery completed successfully |
| Missing fields | Check include flag in JSON configuration |
| Memory errors | Increase heap in build.gradle: -Xmx24g |
| Empty arrays | Verify extractField in array configuration |
# Check logs
tail -f logs/application.log
# Verify MongoDB connection
mongo $MONGO_URL --eval "db.listings.count()"
# Check configuration
jq '.fields[] | select(.include==false)' config/listings_fields.json- Discovery Phase: ~2-3 minutes for 10,000 document sample
- Export Phase: 3,500-5,000 documents/second (varies with expansion)
- Memory Usage: 16-24GB heap recommended
- Collection Caching: Auto-caches collections <100K documents
- Standard: RFC 4180 compliant CSV format
- Quoting: Fields containing commas, quotes, or newlines are quoted
- Escaping: Quotes within fields are escaped by doubling (
"") - Line endings: CRLF (
\r\n) as per RFC 4180 - Encoding: UTF-8
Each export generates a {collection}_summary.json file containing:
- Field-level statistics (null counts, unique values, sample data)
- Field categorization (ALWAYS_EMPTY, SINGLE_VALUE, MEANINGFUL)
- Value distributions for meaningful fields
- Export metadata (processing time, document count)
src/main/java/com/example/mongoexport/
βββ config/ # Configuration classes
β βββ FieldConfiguration.java # Individual field metadata
β βββ DiscoveryConfiguration.java # Root configuration
βββ discovery/
β βββ FieldDiscoveryService.java # Field discovery and audit logic
βββ export/
β βββ ConfigurationBasedExporter.java # Config-based export
βββ DiscoveryRunner.java # Discovery entry point
βββ ConfigExportRunner.java # Export entry point
config/ # Configuration files
βββ {collection}_fields.json # Editable field configuration
βββ {collection}_expansion_audit.txt # Visual expansion tree
output/ # Export results
βββ {collection}_ultra_comprehensive_{timestamp}.csv
- Separated discovery and export phases
- Human-editable JSON configuration
- Enhanced array field handling with reference lookups
- Improved collection caching
- Shows available fields for user configuration
Private repository - Internal use only
For issues or questions:
- Check the testing checklist above
- Review CLAUDE.md for detailed documentation
- Contact the development team