-
Notifications
You must be signed in to change notification settings - Fork 303
Suggestions for simple modifications
New to the system, and looking for something small and/or fairly self-contained to work on?
The following is a list of potential changes which will hopefully fit the bill. Please feel free to ask for more details.
##Code generator
-
Clean up executing the eclcc code generator regression suite on linux.
-
gh-467 A new dataset operator for extracting subsets of a stream
-
Add a user flag to keyed joins to indicate order does not need to be preserved.
-
Better BETWEEN code. Sometimes the test expression is calculated twice - e.g., EXP(SUM(values, LN(prob))) between 0.1 and 0.9
-
Optimize generation of return exists(...).
Simplest way would be to have a special target which when you assigned to it generated a return - but you need to be careful about assigning twice then. -
Constant fold SORT(constant-inline-dataset)
-
Optimize comparisons of utf8/unicode against blank strings, and use the rtlCompareStrBlank() in more situations.
-
Optimize if (count(ds)>0, ds[1].field, ) to ds[1].field
-
Allow link counted child rows as well as child datasets.
-
Special case child datasets with a maxcount of 1 and just store a pointer.
-
Optimize code generated for EXISTS(JOIN(a,b,cond,all)) when evaluated inline on child datasetes inside a transform.
-
Optimize COUNT(DATASET(myset, rec)) to COUNT(myset)
-
Combine aggregations. E.g,
cnt1 := COUNT(ds(filter1));
cnt2 := COUNT(ds(filter2));
becomes
agg := TABLE(ds, {cnt1 := COUNT(GROUP, filter1); cnt2 := COUNT(GROUP, filter2); });
cnt1 := agg[1].cnt1;
cnt2 := agg[1].cnt2;
The advantage is that the counts will be done directly on the disk buffer withour creating records and splitting - and will also reduce multithreading overhead.
-
Use a thread variable for the bcd stack and remove the bcd critical block.
-
Finish support for utf8 fields.
-
Introduce a compressed archive format (which appropriate magic header).
-
Implement a IEclSourceCollection that links to libarchive to allow building direct from compressed tar files etc.
-
Restrict hqlfold to only fold registered plugins.
-
Optimize AGGREGATE(ds, SELF.x := RIGHT.x & LEFT.x) to directly modify the target record.
-
Allow main module to be split over multiple C++ files. (Could be useful for compile times on very large queries.)
-
Cache the row allocators in the xml transformation classes. (Will require onCreate() to be added..)
-
Add a PROJECT (?) option on a JOIN to indicate it is worth pre-projecting any complex join fields. (Note projecting a guarded join condition could slow it down significantly, so may only apply to first, or need to be configurable.)
More complex:
-
Better support for C++ definitions - allow dependencies on other attributes, and on other C++ files/libraries.
-
Finish work on allowing conditional statements in graphs, and enable.
-
Support an ecl through-pipe activity, with streaming input(s) and output(s?)
-
Revisit the packing option, and the option to auto-pack fields. (Always add pack in implicit project if the field order isn't fixed.)
-
Better processing of the dataset format. E.g., don't deserialize on slaves or for disk-read->output.
-
Make dataset size-field configurable.
-
Option to only use link counted child rows on datasets with elements above a certain size (e.g., sizeof(void *) bytes)?
-
Add a transform to track which sort orders/distributions are actually used, and then tag activities (e.g., keyed joins) or remove them if the sort/distribution isn't required.
-
Allow datasets and strings etc to configure whether they are prefixed with a count/length or size.
Main complication is the number of places that would need to be changed. -
Allow length/size for a dataset/string to be separated from the data.
Would improve packing and alignment, but the representation is tricky. -
Add an attribute to all activities (especially piperead, random, user-functions) which allow it to reference a expression which inidcates the scope it should be evaluated in. It may need to be an operator (in addition?).
-
Optimize multiple aggregates on the same dataset - e.g., count(ds(a=x)), count(ds(a=y)) into a single loop when done inline.
-
Better resourcing of inline dataset operations
-
Minimize the data sent to roxie slaves e.g., for indexread/ keyed join by generating a separate slave helper.
It would also have the benefit of making the master helper "colocal" (in smae meory spaces as the owner activity). -
Implement link counted strings, and switch all temporary strings over to using them.
Will cause incompatibilities with existing plugins. -
Implement a costing algorithm for IHqlExpressions
-
Implement some kind of MAP type, and use a hash table lookup for "x in MAP(...)"
-
Optimize order of filters. Costing is a prerequisite.
-
Use the expression costing to implicitly add ,PROJECT to a JOIN
-
Expand implicit project code to work on child records.
-
Lightwieght grouped self-join which performs an all self-join on each input group. (May require minimal work in engines.)
-
Allow much more flexiblity in out of line user-functions.
##PARSE
-
Allow UTF8 strings to be processed efficiently (creating DFAs etc.)
-
Revisit the tomita parser and allow it to use a more general lexer.
-
Rethink the pattern/token/rule approach of the tomita parser and allow it to be used interchangably with the regex parser.
-
Better unicode pattern matching - look at how flex handles character classes.
-
Provide the option for using a strictly conforming xml parser for xml reads. (Using a 3rd party library).
##FileView2
-
Revisit the field mapping transformations and allow them to define an arbitrary ecl function.
-
Allow any datset (including alien datatypes) to be displayed without generating a helper function. Probably requires an ECL interpreter. Extending the scope (to cover hqlfold expressions) would help a lot.
##Windows
- Port eclcc to windows64 (disabling boost/ssl would leave hqlfold to be implemented)
##Mac
- Finish port of eclcc to mac