Skip to content

CDS Extractor Rewrite Phase 2 : Improve Performance and Precision #195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 40 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
5bafe3d
Refactor CDS extractor for dedicated "cds" package
data-douser May 15, 2025
e4c1ff0
Fix CDS extractor findPackageJsonDirs
data-douser May 15, 2025
7e58207
Rename CDS extractor entrypoint and refactor args
data-douser May 16, 2025
af80a68
Add self-parser.test.ts for CDS extractor
data-douser May 16, 2025
ea19649
CDS extractor tests for compiler & packageManager
data-douser May 17, 2025
f9e41aa
Fix CDS extractor environment setup
data-douser May 18, 2025
6412400
Improve CDS extractor logging
data-douser May 18, 2025
009fe42
First attempt at project-aware CDS compilation
data-douser May 18, 2025
2c68c8a
Refactor CDS extractor for dedicated "cds" package
data-douser May 15, 2025
8e09758
Rename CDS extractor entrypoint and refactor args
data-douser May 16, 2025
1dd464b
Add self-parser.test.ts for CDS extractor
data-douser May 16, 2025
729dd2e
CDS extractor tests for compiler & packageManager
data-douser May 17, 2025
0c75133
Fix CDS extractor environment setup
data-douser May 18, 2025
09fa955
Improve CDS extractor logging
data-douser May 18, 2025
c865d94
First attempt at project-aware CDS compilation
data-douser May 18, 2025
d629d1e
Merge branch 'data-douser/cds-ts-rewrite-2' of github.com:data-douser…
data-douser May 18, 2025
5aa2d54
Update node dependencies for CDS extractor
data-douser May 18, 2025
86b5572
Update CDS extractor flowchart diagram
data-douser May 18, 2025
d6a99da
Merge branch 'main' into data-douser/cds-ts-rewrite-2
data-douser Jun 8, 2025
bc82815
Fixes CDS extractor project-aware file detection
data-douser Jun 9, 2025
6315a49
Remove "--parse" from CDS compile command
data-douser Jun 10, 2025
bc4a2cd
Merge branch 'advanced-security:main' into data-douser/cds-ts-rewrite-2
data-douser Jun 10, 2025
27743ba
Simplify CDS extractor logic and refactor
data-douser Jun 10, 2025
7a05f80
Fix project-aware CDS compile file paths
data-douser Jun 11, 2025
af066d7
Merge branch 'advanced-security:main' into data-douser/cds-ts-rewrite-2
data-douser Jun 11, 2025
0ba67d5
Fix code-scanning alerts for insecure tmp files
data-douser Jun 11, 2025
0f1ac9e
Improve testing of CDS extractor graph
data-douser Jun 11, 2025
aaab73b
Update CDS extractor node dependencies
data-douser Jun 11, 2025
e4bbc1f
Implement cdsExtractorLog for consistent logging
data-douser Jun 11, 2025
99923e0
Common project graph for CDS parse and compile
data-douser Jun 25, 2025
efc989f
Fix CdlService getImplementation file location
data-douser Jun 25, 2025
9dfbe3b
Use shell-quote.quote in testCdsCommand()
data-douser Jun 25, 2025
fcac0f1
Fix unit test for CDS extractor
data-douser Jun 25, 2025
3573995
Fix CDS extractor args validation
data-douser Jun 26, 2025
1c5d011
Improve cdsExtractorLog for debugging performance
data-douser Jun 26, 2025
359290e
Replace Math.random() in CDS extractor
data-douser Jun 26, 2025
d4dd37a
Refactor CDS extractor run modes
data-douser Jun 27, 2025
90453cc
Fix CDS extractor monorepo support
data-douser Jun 27, 2025
aa7dac4
Replace CDS extractor autobuild.md with README.md
data-douser Jun 27, 2025
cc29a54
Add CDS extractor JS dist and node_modules
data-douser Jun 27, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
The diff you're trying to view is too large. We only load the first 3000 changed files.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -71,3 +71,5 @@ tmp/
**.testproj
dbs
*.cds.json
.cds-extractor-cache

50 changes: 32 additions & 18 deletions extractors/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,16 +30,20 @@ pre-finalize.sh`"]
JSE[[javascript extractor]]
DTRAC[codeql database<br>trace-command]
SPF[[pre-finalize.sh]]
DIDX[codeql database index-files<br> --language=cds<br>--include-extension=.cds]
SIF[[index-files.sh]]
SIT[[index-files.ts/js]]
NPM[[npm install & build]]
DETS[[Determine CDS command]]
FIND[[Find package.json dirs]]
INST[[Install dependencies]]
CC[[cds compiler]]
ABCMD[[autobuild.sh/cmd]]
ABT[[cds-extractor.ts/js]]
ENV[[setup & validate<br>environment]]
PDG[[build project<br>dependency graph]]
INSTC[[install dependencies<br>with caching]]
PROC[[process CDS files<br>to JSON]]
PMAP[[project-aware<br>dependency resolution]]
FIND[[find project for<br>CDS file]]
CDCMD[[determine CDS<br>command for project]]
COMP[[compile CDS<br>to JSON]]
CDJ([.cds.json files])
FILT[[configure LGTM<br>index filters]]
JSA[[javascript extractor<br>autobuild script]]
DIAG[[add compilation<br>diagnostics]]
TF([CodeQL TRAP files])
DBF[codeql database finalize<br> -- /path/to/database]

Expand All @@ -54,20 +58,30 @@ pre-finalize.sh`"]
JSE ==> |run autobuild within<br>the javascript extractor| DTRAC

DTRAC ==> |run the build --command| SPF
SPF ==> |run codeql index-files<br>for CDS files| DIDX
DIDX ==> |invoke script via<br>--search-path| SIF
SIF ==> |runs TypeScript version<br>after npm install| NPM
NPM ==> |executes compiled<br>index-files.js| SIT
SPF ==> |run autobuilder<br>for CDS files| ABCMD
ABCMD ==> |runs TypeScript version<br>of CDS extractor| ABT

SIT ==> |finds project directories<br>with package.json| FIND
FIND ==> |install CDS dependencies<br>in project directories| INST
SIT ==> |determines which<br>cds command to use| DETS
DETS ==> |processes each CDS file| CC
ABT ==> |setup and validate<br>environment first| ENV
ABT ==> |build project dependency<br>graph for source root| PDG
PDG ==> |analyze CDS projects<br>structure & relationships| PMAP

ABT ==> |efficiently install<br>required dependencies| INSTC
INSTC ==> |use cached approach for<br>dependency installation| PMAP

ABT ==> |process each CDS file<br>to generate JSON files| PROC
PROC ==> |find which project<br>contains this CDS file| FIND
FIND ==> |uses project-aware<br>dependency resolution| PMAP
FIND ==> |determine appropriate<br>CDS command for project| CDCMD

CDCMD ==> |compile CDS file to JSON<br>with project context| COMP
COMP ==> |generate JSON representation<br>with project awareness| CDJ
COMP --x |if compilation fails,<br>report diagnostics| DIAG
DIAG -.-> |diagnostics stored<br>in database| DB

CC ==> |compile .cds files to<br>create .cds.json files| CDJ
CDJ -.-> |stored in same location<br>as original .cds files| DB

SIT ==> |configures extraction<br>filters for JSON files| JSA
ABT ==> |configure extraction<br>filters for JSON files| FILT
ABT ==> |run JavaScript extractor<br>to process JSON files| JSA
JSA ==> |processes .cds.json files<br>via javascript extractor| CDJ

CDJ ==> |javascript extractor<br>generates TRAP files| TF
Expand Down
17 changes: 7 additions & 10 deletions extractors/cds/tools/.gitignore
Original file line number Diff line number Diff line change
@@ -1,12 +1,9 @@
# Ignore the entire "out" directory as this is for the .js and .js.map files
# which are generated by the `tsc` build process. In the current project config,
# we require the platform-specific "index-files" shell/cmd script to run the
# `npm run build` command that generates the files for the correct platform and
# local environment.
out/
# Ignore files create just for debugging the CDS extractor.
debug/

# Since we expect the build process to be run on the system where the CDS extractor
# is being run, we do not need/want to check-in our own package-lock.json version
# when we know it will be different / overwritten on each system.
package-lock.json
# Override the repository-level .gitignore to explicitly include the "dist" and
#"node_modules" sub-directories, which are used to store the CDS extractor JS build
#files and their Node.js dependencies, respectively.
!dist/
!node_modules/

221 changes: 221 additions & 0 deletions extractors/cds/tools/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,221 @@
# CodeQL CDS Extractor

A robust CodeQL extractor for [Core Data Services (CDS)][CDS] files used in [SAP Cloud Application Programming (CAP)][CAP] model projects. This extractor processes `.cds` files and compiles them into `.cds.json` files for CodeQL analysis while maintaining project-aware parsing and dependency resolution.

## Overview

The CodeQL CDS extractor is designed to efficiently process CDS projects by:

- **Project-Aware Processing**: Analyzes CDS files as related project configurations rather than independent definitions
- **Optimized Dependency Management**: Caches and reuses `@sap/cds` and `@sap/cds-dk` dependencies across projects
- **Enhanced Precision**: Reduces false-positives in CodeQL queries by understanding cross-file relationships
- **Performance Optimization**: Avoids duplicate processing and unnecessary dependency installations

## Architecture

The extractor uses an `autobuild` approach with the following key components:

### Core Components

- **`cds-extractor.ts`**: Main entry point that orchestrates the extraction process
- **`src/cds/parser/`**: CDS project discovery and dependency graph building
- **`src/cds/compiler/`**: Compilation orchestration and `.cds.json` generation
- **`src/packageManager/`**: Dependency installation and caching
- **`src/logging/`**: Unified logging and performance tracking
- **`src/environment.ts`**: Environment setup and validation
- **`src/codeql.ts`**: CodeQL JavaScript extractor integration

### Extraction Process

1. **Environment Setup**: Validates CodeQL tools and system requirements
2. **Project Discovery**: Recursively scans for CDS projects and builds dependency graph
3. **Dependency Management**: Installs and caches required CDS compiler dependencies
4. **CDS Compilation**: Compiles `.cds` files to `.cds.json` using project-aware compilation
5. **JavaScript Extraction**: Runs CodeQL's JavaScript extractor on source and compiled files

## Usage

### Prerequisites

- Node.js (accessible via `node` command)
- CodeQL CLI tools
- SAP CDS projects with `.cds` files

### Running the Extractor

The extractor is typically invoked by CodeQL during database creation:

```bash
codeql database create --language=cds --source-root=/path/to/project my-database
```

### Manual Execution

For development and testing purposes:

```bash
# Build the extractor
npm run build

# Run directly (from project source root)
node dist/cds-extractor.js /path/to/source/root
```

## Development

### Project Structure

```text
extractors/cds/tools/
├── cds-extractor.ts # Main entry point
├── src/ # Source code modules
│ ├── cds/ # CDS-specific functionality
│ │ ├── compiler/ # Compilation orchestration
│ │ └── parser/ # Project discovery and parsing
│ ├── logging/ # Logging and performance tracking
│ ├── packageManager/ # Dependency management
│ ├── codeql.ts # CodeQL integration
│ ├── diagnostics.ts # Error reporting
│ ├── environment.ts # Environment setup
│ ├── filesystem.ts # File system utilities
│ └── utils.ts # General utilities
├── test/ # Test suites
├── dist/ # Compiled JavaScript output
└── package.json # Project configuration
```

### Building

```bash
# Install dependencies
npm install

# Build TypeScript to JavaScript
npm run build

# Run all checks and build
npm run build:all
```

### Testing

```bash
# Run tests
npm test

# Run tests with coverage
npm run test:coverage

# Run tests in watch mode
npm run test:watch
```

### Code Quality

```bash
# Lint TypeScript files
npm run lint

# Auto-fix linting issues
npm run lint:fix

# Format code
npm run format
```

## Configuration

### Environment Variables

The extractor respects several CodeQL environment variables:

- `CODEQL_DIST`: Path to CodeQL distribution
- `CODEQL_EXTRACTOR_CDS_WIP_DATABASE`: Target database path
- `LGTM_INDEX_FILTERS`: File filtering configuration

### CDS Project Detection

Projects are detected based on:

- Presence of `package.json` files
- CDS files (`.cds`) in the project directory tree
- Valid CDS dependencies (`@sap/cds`, `@sap/cds-dk`) in package.json

### Compilation Strategy

The extractor uses a sophisticated compilation approach:

1. **Dependency Graph Building**: Maps relationships between CDS projects
2. **Smart Caching**: Reuses compiled outputs and dependency installations
3. **Error Recovery**: Handles compilation failures gracefully
4. **Performance Tracking**: Monitors compilation times and resource usage

## Performance Features

### Optimized Dependency Management

- **Shared Dependency Cache**: Single installation per unique dependency combination
- **Isolated Environments**: Dependencies installed in temporary cache directories
- **No Source Modification**: Original project files remain unchanged

### Efficient Processing

- **Project-Level Compilation**: Compiles related CDS files together
- **Duplicate Avoidance**: Prevents redundant processing of imported files
- **Memory Tracking**: Monitors and reports memory usage throughout extraction

### Scalability

- **Large Codebase Support**: Optimized for enterprise-scale CDS projects
- **Parallel Processing**: Where possible, processes independent projects concurrently
- **Resource Management**: Cleans up temporary files and cached dependencies

## Integration with CodeQL

### File Processing

The extractor processes both:

- **Source Files**: Original `.cds` files for source code analysis
- **Compiled Files**: Generated `.cds.json` files for semantic analysis

### Database Population

- Integrates with CodeQL's JavaScript extractor for final database population
- Maintains proper file relationships and source locations
- Supports CodeQL's standard indexing and filtering mechanisms

## Troubleshooting

### Common Issues

1. **Missing Node.js**: Ensure `node` command is available in PATH
2. **CDS Dependencies**: Verify projects have valid `@sap/cds` dependencies
3. **Compilation Failures**: Check CDS syntax and cross-file references
4. **Memory Issues**: Monitor memory usage for very large projects

### Debugging

The extractor provides comprehensive logging:

- **Performance Tracking**: Times for each extraction phase
- **Memory Usage**: Memory consumption at key milestones
- **Error Reporting**: Detailed error messages with context
- **Project Discovery**: Information about detected CDS projects

### Log Levels

- `info`: General progress and milestone information
- `warn`: Non-critical issues that don't prevent extraction
- `error`: Critical failures that may affect extraction quality

## References

- [SAP Cloud Application Programming Model][CAP]
- [Core Data Services (CDS)][CDS]
- [Conceptual Definition Language (CDL)][CDL]
- [CodeQL Documentation](https://codeql.github.com/docs/)

[CAP]: https://cap.cloud.sap/docs/about/
[CDS]: https://cap.cloud.sap/docs/cds/
[CDL]: https://cap.cloud.sap/docs/cds/cdl
Loading