Skip to content

iyulab/DataLens

Repository files navigation

DataLens

NuGet NuGet Downloads Build License: MIT

A .NET library for exploratory data analysis and statistical profiling.

Overview

DataLens answers the question: "What's in my data?" — before you clean it, before you model it.

Given a CSV/JSON dataset, DataLens produces comprehensive statistical analysis that helps you understand distributions, relationships, patterns, and anomalies. It combines FilePrepper for data ingestion with UInsight (Rust FFI) for high-performance computation.

Where DataLens Fits

CSV / JSON
  │
  ├── "Understand" → DataLens    → Analysis result + JSON
  │
  ├── "Clean"      → FilePrepper → Cleaned CSV
  │
  └── "Predict"    → MLoop       → Models, predictions
Tool Purpose Input Output
DataLens Understand your data CSV / JSON Analysis result objects, JSON
FilePrepper Clean & transform data CSV Cleaned CSV
MLoop Train & deploy ML models CSV ML model, predictions

DataLens is not a replacement for FilePrepper or MLoop — it's the first step before either of them.

Quick Start

Installation

dotnet add package DataLens

One-Line Analysis

using DataLens;

// Run the full analysis pipeline and write the result as JSON
var analysis = await DataLensEngine.Analyze("manufacturing_data.csv");
await analysis.ToJsonAsync("results.json");

HTML / chart rendering is intentionally out of scope. Pair the AnalysisResult with a renderer of your choice (Plotly.NET, ScottPlot, OxyPlot) or a future companion package such as DataLens.Reports.Plotly.

POCO Collections (no file required)

using DataLens;

record Sale(DateTime 주문일자, decimal 금액, string 고객명);

var sales = new List<Sale>
{
    new(new DateTime(2026, 1, 1, 0, 0, 0, DateTimeKind.Utc), 1000m, "갑"),
    new(new DateTime(2026, 1, 2, 0, 0, 0, DateTimeKind.Utc), 2500m, "을"),
    // ...
};

var analysis = await DataLensEngine.Analyze(sales);

IEnumerable<T> is a first-class input. POCO properties are extracted via reflection ([JsonIgnore] and [DataLensIgnore] are honored). Dictionary-like inputs (IDictionary<string, object?>, ExpandoObject) are also supported — keys become column names. Custom selectors and header aliases are available via EnumerableSourceOptions<T>.

Programmatic Access

using DataLens;

var analysis = await DataLensEngine.Analyze("manufacturing_data.csv");

// Profile (row/column counts, per-column null %, type, basic stats)
Console.WriteLine($"Rows: {analysis.Profile!.RowCount}, Cols: {analysis.Profile.ColumnCount}");
foreach (var col in analysis.Profile.Columns)
{
    Console.WriteLine($"{col.Name}: type={col.DataType}, null={col.NullPercentage:F1}%");
}

// Descriptive statistics (mean, std, skew, kurtosis, ...)
foreach (var col in analysis.Descriptive!.Columns)
{
    Console.WriteLine($"{col.Name}: mean={col.Mean:F3}, skew={col.Skewness:F3}");
}

// Correlation — high pairs already filtered by AnalysisOptions.CorrelationThreshold
foreach (var pair in analysis.Correlation!.HighCorrelationPairs)
{
    Console.WriteLine($"{pair.Column1} ~ {pair.Column2}: r={pair.Value:F3}");
}

// Clusters
var kmeans = analysis.Clusters!.KMeans;
if (kmeans is not null)
{
    Console.WriteLine($"K={kmeans.K}, WCSS={kmeans.Wcss:F3}");
    foreach (var cluster in kmeans.ClusterSizes)
    {
        Console.WriteLine($"  Cluster {cluster.ClusterId}: {cluster.Size} rows ({cluster.Percentage:F1}%)");
    }
}

// Outliers
Console.WriteLine($"Outliers: {analysis.Outliers!.OutlierCount} rows ({analysis.Outliers.OutlierPercentage:F1}%)");

Selecting analyses

Use AnalysisOptions to enable/disable specific analyzers:

var options = new AnalysisOptions
{
    IncludeProfiling   = true,
    IncludeDescriptive = true,
    IncludeCorrelation = true,
    IncludeClustering  = false,
    IncludeOutliers    = false,
    IncludeFeatures    = false,
    IncludePca         = false,
    IncludeChangepoints = false,
    CorrelationThreshold = 0.8
};

var analysis = await DataLensEngine.Analyze("data.csv", options);
var json = analysis.ToJson(Section.Correlation); // Single-section JSON

Analysis Modules

1. Data Profiling

Per-column overview: type detection, null counts, basic numeric summary.

var profile = await DataLensEngine.Profile("data.csv");
Console.WriteLine($"Rows: {profile.RowCount}, Columns: {profile.ColumnCount}");
foreach (var col in profile.Columns)
{
    Console.WriteLine($"{col.Name}: type={col.DataType}, null={col.NullPercentage:F1}%");
}

2. Descriptive Statistics

Full numeric summary per column: count, mean, median, std, variance, Q1/Q3/IQR, skewness, kurtosis.

var analysis = await DataLensEngine.Analyze("data.csv");
foreach (var col in analysis.Descriptive!.Columns)
{
    Console.WriteLine($"{col.Name}: mean={col.Mean:F3}, std={col.Std:F3}, skew={col.Skewness:F3}");
}

3. Correlation Analysis

  • Pearson correlation matrix over numeric columns
  • High-correlation pairs auto-filtered by AnalysisOptions.CorrelationThreshold
  • Cramér's V for categorical associations
var corr = analysis.Correlation!;
foreach (var pair in corr.HighCorrelationPairs)
{
    Console.WriteLine($"{pair.Column1} ~ {pair.Column2}: r={pair.Value:F3}");
}

4. Regression Analysis

Per-feature simple regression against AnalysisOptions.TargetColumn — one RegressionEntry per feature.

var options = new AnalysisOptions { TargetColumn = "S_OutputPower", IncludeRegression = true };
var analysis = await DataLensEngine.Analyze("data.csv", options);
var regression = analysis.Regression!;
foreach (var entry in regression.Entries)
{
    Console.WriteLine($"{entry.FeatureColumn}: slope={entry.Slope:F4}, R²={entry.RSquared:F4}");
}

5. Cluster Analysis

K-Means (with auto-K via Gap statistic), DBSCAN, Hierarchical, HDBSCAN.

var clusters = analysis.Clusters!;
Console.WriteLine($"Optimal K={clusters.OptimalK}");
if (clusters.KMeans is { } km)
{
    foreach (var cluster in km.ClusterSizes)
    {
        Console.WriteLine($"Cluster {cluster.ClusterId}: {cluster.Size} rows");
    }
}

6. Outlier Detection

Isolation Forest, LOF, and Mahalanobis distance.

var outliers = analysis.Outliers!;
Console.WriteLine($"Outliers: {outliers.OutlierCount} rows ({outliers.OutlierPercentage:F1}%)");
if (outliers.IsolationForest is { } iso)
{
    Console.WriteLine($"  IsolationForest: {iso.AnomalyCount} anomalies (threshold={iso.Threshold:F3})");
}

7. Feature Importance

ANOVA F-test, mutual information, and permutation importance against a target column.

var report = await DataLensEngine.FeatureImportance("data.csv", target: "Machining_Process");
foreach (var feat in report.Importance!.Scores)
{
    Console.WriteLine($"  {feat.Name}: {feat.Score:F4}");
}

8. Dimensionality Reduction (PCA)

var pca = analysis.Pca!;
Console.WriteLine($"Components: {pca.NComponents}, total variance explained: {pca.TotalExplainedVariance:P1}");
for (int i = 0; i < pca.ExplainedVariance.Length; i++)
{
    Console.WriteLine($"  PC{i + 1}: {pca.ExplainedVariance[i]:P1}");
}

9. Changepoint Detection

PELT-based changepoint detection (multivariate, configurable cost function).

var options = new AnalysisOptions
{
    IncludeChangepoints = true,
    ChangepointCost = 1, // 0=L2 mean, 1=Normal mean+variance
    ChangepointMinSegmentLength = 10
};
var analysis = await DataLensEngine.Analyze("timeseries.csv", options);

Output

DataLens emits results as JSON. Chart / HTML rendering is delegated to renderer packages (see Out of Scope).

// Full result
var json = analysis.ToJson();
await analysis.ToJsonAsync("results.json");

// Section-scoped JSON
var corrJson = analysis.ToJson(Section.Correlation);

Section members: Profile, Descriptive, Correlation, Regression, Clusters, Outliers, Distribution, Features, Pca, Changepoints.

Architecture

┌──────────────────────────────────────────┐
│           DataLens (C# .NET)             │
│                                          │
│  ┌──────────┐  ┌───────────────────────┐ │
│  │ Analysis  │  │ JSON Serializer       │ │
│  │ Pipeline  │  │  (renderer-agnostic)  │ │
│  │           │  │                       │ │
│  └─────┬────┘  └───────────┬───────────┘ │
│        │                   │             │
├────────┴───────────────────┴─────────────┤
│  ┌──────────────┐  ┌──────────────────┐  │
│  │ FilePrepper   │  │ UInsight (C#)    │  │
│  │ (C# native)   │  │ ↓ FFI            │  │
│  │               │  │ UInsight (Rust)  │  │
│  │ • CSV / JSON  │  │                  │  │
│  │ • DataFrame   │  │ • Statistics     │  │
│  │ • Type detect │  │ • Correlation    │  │
│  │               │  │ • Clustering     │  │
│  └──────────────┘  │ • PCA            │  │
│                    │ • Outlier detect │  │
│                    │ • Regression     │  │
│                    │ • Changepoints   │  │
│                    └──────────────────┘  │
└──────────────────────────────────────────┘

Integration with iyulab Tools

FilePrepper → DataLens

DataLens uses FilePrepper internally for CSV/JSON ingestion via CsvBridge. For pre-cleaning, run a FilePrepper pipeline and feed the resulting CSV to DataLens (or pass a DataFrame directly):

using FilePrepper.Pipeline;

var pipeline = await DataPipeline.FromCsvAsync("raw_data.csv");
// ... apply FilePrepper transforms ...
var df = pipeline.ToDataFrame();

var analysis = await DataLensEngine.Analyze(df);

DataLens → MLoop

DataLens analysis results can guide MLoop training decisions:

var options = new AnalysisOptions { TargetColumn = "target_column", IncludeFeatures = true };
var analysis = await DataLensEngine.Analyze("train.csv", options);

// Top features by ANOVA F-score
var topByAnova = analysis.Features!.Anova!.Features
    .OrderByDescending(f => f.FStatistic)
    .Take(15);

// High-correlation pairs (multicollinearity hints)
foreach (var pair in analysis.Correlation!.HighCorrelationPairs)
{
    Console.WriteLine($"{pair.Column1} ~ {pair.Column2}: r={pair.Value:F3}");
}

// Then proceed to MLoop with confidence:
// mloop train datasets/train.csv target_column --time 120

Scope & Non-Goals

In Scope:

  • Exploratory data analysis (EDA)
  • Statistical profiling and summaries
  • Relationship and pattern discovery (correlation, clustering, PCA)
  • Outlier and changepoint detection
  • JSON output for programmatic consumption
  • CSV / JSON ingestion via FilePrepper

Out of Scope:

  • Data cleaning / transformation (→ FilePrepper)
  • ML model training / prediction (→ MLoop)
  • Deep learning (CNN, LSTM, Autoencoder)
  • Real-time streaming analysis
  • Interactive notebook environments
  • HTML / chart rendering — pair AnalysisResult with a renderer of your choice (Plotly.NET, ScottPlot, OxyPlot, or a future companion package such as DataLens.Reports.Plotly). The core stays JSON-first.

Available:

  • Encoding auto-detection in CsvBridge (FilePrepper 0.7.0+) — new CsvLoadOptions { Encoding = "auto" } (default) detects BOM and falls back to a CP949/EUC-KR/UTF-8 heuristic. Override with explicit codepage names (e.g., "cp949", "euc-kr", "utf-8", "utf-8-bom"). CLI: pass --encoding cp949 to any command. JSON inputs are UTF-8 per RFC 8259.

Every code block in this README is exercised by samples/DataLens.Sample, so build failures there fail the build. If a snippet here drifts from the actual API, CI catches it.

Requirements

License

MIT License — Built by iyulab

About

.NET exploratory data analysis library that profiles CSV/JSON datasets with clustering, outlier detection, PCA, correlation, feature importance, and changepoint detection before modeling.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages