Skip to content

Conversation

@stress-tess
Copy link
Member

@stress-tess stress-tess commented Jul 1, 2021

This pull request updates PR #627 submitted by @zhihuidu.

This PR:

  • Brings string-suffix-array-functionality up to date with master
  • Rearranges code to increase separation between strings and suffix arrays and minimize diff with master/highlight changes
  • Removes external libdivsufsort in favor of zhihuidu's native chapel suffix array construction implementation
  • Changes formatting in some areas

bf1757c Repackages zhihuidu's commits in PR#627
  • Merge squashed commits and rebased over a fresh copy of master. This makes the commits more portable and the git history of master more readable with all commits grouped together
  • Helps minimize diff with master since I default to master in conflicts where there are only formatting differences
07f51d8 Applies original minimum changes to resolve conflicts with master
ca4941b Implements changes suggested in PR 627 feedback
  • Minimizes diff with master to highlight functionality changes
  • Moves SArrarys class to suffix_array.py
  • Refomats suffix_array.py
  • Reformats SACA.chpl to fix bracket alignment
  • Removes commented out code
  • Adds type hints to size and bytes in strings.py for mypy
  • Replaces writelns in SegmentedArray.chpl with logging statements following the example of SegString class

Previously stand-alone commits have been squashed into this one since Zhihui approved them

  • Removes external libdivsufsort code and references. Defaults to zhihuidu's native Chapel suffix array construction implementation using skew algorithm (based on Simple Linear Work Suffix Array Construction)
  • Using the native chapel suffix array construction implementation prevents copying the libdivsufsort codebase into arkouda or forcing the dependency on arkouda users.
  • Renames class SegSArray to class SegSuffixArray and moves this class to SegmentedSuffixArray.chpl (previously in SegmentedArray.chpl along with SegString). Rearranged to increase separation with strings and minimize diff with master.
935f29b Implements changes suggested in this PR's feedback
  • Removes unused fields variable in _parse_single_int_array_value
  • Passes Exception into ValueError in _parse_single_int_array_value
  • Inverts logic of in1d_int to avoid duplicating the ind1d logic
  • Adds self.name to SArray class and updates attach/register functions to use dots instead of dashs to be consistent with the Strings class
  • Removes unnecessary cast in suffix_array_file
  • Changes typing from sipHash128 and computeSipHashLocalized to accept int and uint(8)
  • Updates suffix_array registration and info functions
  • Renames SACA.chpl to SuffixArrayConstruction.chpl
  • Adds UnitTestSuffixArrayConstruction.chpl
  • Adds docstrings to SuffixArrayConstruction.chpl and removes duplicated logic by modifying radixPass to accept int or uint(8)

@stress-tess stress-tess requested a review from mhmerrill July 1, 2021 17:06
@stress-tess
Copy link
Member Author

stress-tess commented Jul 1, 2021

@zhihuidu, are you okay with the changes I've made here? I specifically made the move/rename class SegSArray and remove libdivsufsort commits stand alone, so they are easy to revert

Once I have your approval, I'll invite the other core team members to review

Edit: I've heard back from zhihui and he is happy with this restructuring

stress-tess pushed a commit to stress-tess/arkouda that referenced this pull request Jul 14, 2021
- Removes unused `fields` variable in _parse_single_int_array_value
- Passes Exception into ValueError in _parse_single_int_array_value
- Inverts logic of `in1d_int` to avoid duplicating the `ind1d` logic
- Adds `self.name` to `SArray` class and updates attach/register functions to use dots instead of dashs to be consistent with` the `Strings` class
- Removes unnecessary cast in `suffix_array_file`
- Changes typing from `sipHash128` and `computeSipHashLocalized` to accept int and uint(8)
- Updates suffix_array registration and info functions
- Renames SACA.chpl to SuffixArrayConstruction.chpl
- Adds UnitTestSuffixArrayConstruction.chpl
stress-tess pushed a commit to stress-tess/arkouda that referenced this pull request Jul 14, 2021
- Removes unused `fields` variable in _parse_single_int_array_value
- Passes Exception into ValueError in _parse_single_int_array_value
- Inverts logic of `in1d_int` to avoid duplicating the `ind1d` logic
- Adds `self.name` to `SArray` class and updates attach/register functions to use dots instead of dashs to be consistent with` the `Strings` class
- Removes unnecessary cast in `suffix_array_file`
- Changes typing from `sipHash128` and `computeSipHashLocalized` to accept int and uint(8)
- Updates suffix_array registration and info functions
- Renames SACA.chpl to SuffixArrayConstruction.chpl
- Adds UnitTestSuffixArrayConstruction.chpl
stress-tess pushed a commit to stress-tess/arkouda that referenced this pull request Jul 14, 2021
- Removes unused `fields` variable in _parse_single_int_array_value
- Passes Exception into ValueError in _parse_single_int_array_value
- Inverts logic of `in1d_int` to avoid duplicating the `ind1d` logic
- Adds `self.name` to `SArray` class and updates attach/register functions to use dots instead of dashs to be consistent with` the `Strings` class
- Removes unnecessary cast in `suffix_array_file`
- Changes typing from `sipHash128` and `computeSipHashLocalized` to accept int and uint(8)
- Updates suffix_array registration and info functions
- Renames SACA.chpl to SuffixArrayConstruction.chpl
- Adds UnitTestSuffixArrayConstruction.chpl
- Adds docstrings to SuffixArrayConstruction.chpl and removes duplicated logic by modifying `radixPass` to accept int or uint(8)
stress-tess pushed a commit to stress-tess/arkouda that referenced this pull request Jul 16, 2021
- Removes unused `fields` variable in _parse_single_int_array_value
- Passes Exception into ValueError in _parse_single_int_array_value
- Inverts logic of `in1d_int` to avoid duplicating the `ind1d` logic
- Adds `self.name` to `SArray` class and updates attach/register functions to use dots instead of dashs to be consistent with` the `Strings` class
- Removes unnecessary cast in `suffix_array_file`
- Changes typing from `sipHash128` and `computeSipHashLocalized` to accept int and uint(8)
- Updates suffix_array registration and info functions
- Renames SACA.chpl to SuffixArrayConstruction.chpl
- Adds UnitTestSuffixArrayConstruction.chpl
- Adds docstrings to SuffixArrayConstruction.chpl and removes duplicated logic by modifying `radixPass` to accept int or uint(8)
stress-tess pushed a commit to stress-tess/arkouda that referenced this pull request Jul 16, 2021
- Removes unused `fields` variable in _parse_single_int_array_value
- Passes Exception into ValueError in _parse_single_int_array_value
- Inverts logic of `in1d_int` to avoid duplicating the `ind1d` logic
- Adds `self.name` to `SArray` class and updates attach/register functions to use dots instead of dashs to be consistent with` the `Strings` class
- Removes unnecessary cast in `suffix_array_file`
- Changes typing from `sipHash128` and `computeSipHashLocalized` to accept int and uint(8)
- Updates suffix_array registration and info functions
- Renames SACA.chpl to SuffixArrayConstruction.chpl
- Adds UnitTestSuffixArrayConstruction.chpl
- Adds docstrings to SuffixArrayConstruction.chpl and removes duplicated logic by modifying `radixPass` to accept int or uint(8)
stress-tess pushed a commit to stress-tess/arkouda that referenced this pull request Jul 20, 2021
- Removes unused `fields` variable in _parse_single_int_array_value
- Passes Exception into ValueError in _parse_single_int_array_value
- Inverts logic of `in1d_int` to avoid duplicating the `ind1d` logic
- Adds `self.name` to `SArray` class and updates attach/register functions to use dots instead of dashs to be consistent with` the `Strings` class
- Removes unnecessary cast in `suffix_array_file`
- Changes typing from `sipHash128` and `computeSipHashLocalized` to accept int and uint(8)
- Updates suffix_array registration and info functions
- Renames SACA.chpl to SuffixArrayConstruction.chpl
- Adds UnitTestSuffixArrayConstruction.chpl
- Adds docstrings to SuffixArrayConstruction.chpl and removes duplicated logic by modifying `radixPass` to accept int or uint(8)
- Updates SegmentedSuffixArray to use ServerErrors per PR#882
stress-tess pushed a commit to stress-tess/arkouda that referenced this pull request Aug 5, 2021
- Removes unused `fields` variable in _parse_single_int_array_value
- Passes Exception into ValueError in _parse_single_int_array_value
- Inverts logic of `in1d_int` to avoid duplicating the `ind1d` logic
- Adds `self.name` to `SArray` class and updates attach/register functions to use dots instead of dashs to be consistent with` the `Strings` class
- Removes unnecessary cast in `suffix_array_file`
- Changes typing from `sipHash128` and `computeSipHashLocalized` to accept int and uint(8)
- Updates suffix_array registration and info functions
- Renames SACA.chpl to SuffixArrayConstruction.chpl
- Adds UnitTestSuffixArrayConstruction.chpl
- Adds docstrings to SuffixArrayConstruction.chpl and removes duplicated logic by modifying `radixPass` to accept int or uint(8)
- Updates SegmentedSuffixArray to use ServerErrors per PR#882
Pierce Hayes added 4 commits November 22, 2021 14:23
commit 82776f9
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Fri Mar 19 18:19:08 2021 -0400

    solve the inconsistency in dtype

commit 87c6327
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Mon Mar 15 21:33:30 2021 -0400

    merge with the latest version

commit 924ac94
Merge: 18f481c 678097f
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Mon Mar 15 20:33:30 2021 -0400

    Merge branch 'master' of github.com:mhmerrill/arkouda

commit 18f481c
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Mon Mar 15 18:08:01 2021 -0400

     handle some bytes to string

commit fb89d88
Merge: c002419 6d65335
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Mon Mar 15 17:05:35 2021 -0400

     solve conflict of suffix array, SegmentedMsg, SegmentedArray and run_benchmarks

commit 6d65335
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Sun Jan 10 22:18:31 2021 -0500

    tuple data type

commit f5ca67a
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Sun Jan 10 20:44:09 2021 -0500

    remove mypy CI check error

commit 010a446
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Sun Jan 10 19:35:09 2021 -0500

    return the string for suffix_array_file

commit 80f78c6
Merge: 00b3579 e192878
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Fri Jan 8 16:08:57 2021 -0500

    resolve convlict

commit 00b3579
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Fri Jan 8 15:00:43 2021 -0500

    single locales for C code

commit c782b5a
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Fri Jan 8 13:19:21 2021 -0500

    add switch betwteen different SA algorithms

commit 2d40c0e
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Tue Jan 5 14:14:36 2021 -0500

    solve the sphinx error

commit 2af5ce8
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Tue Jan 5 09:57:42 2021 -0500

    check the comments to remove docs CI check error

commit 0bff3e4
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Mon Jan 4 18:45:28 2021 -0500

    remove the enhenced attribute in sym table

commit a8a195a
Merge: 221679e f1781e8
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Mon Jan 4 17:40:48 2021 -0500

    Merge pull request Bears-R-Us#4 from reuster986/master

    Fixed bug in UnitTestPeelStick

commit 221679e
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Mon Jan 4 17:23:31 2021 -0500

    update lcp related code
n
commit f1781e8
Author: Bill Reus <reuster986@gmail.com>
Date:   Mon Jan 4 16:18:51 2021 -0500

    Fixed bug in UnitTestPeelStick

commit 50be2e3
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Sun Dec 27 22:25:52 2020 -0500

    copy string_test.py

commit 6fd3b05
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Sun Dec 27 18:31:09 2020 -0500

    change suffix array return as an int array

commit ac7e209
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Sun Dec 27 16:40:20 2020 -0500

    add corectness check in sa.py

commit 9f5c3d3
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Sat Dec 26 10:28:59 2020 -0500

    check test/*.chpl

commit ed98498
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Sat Dec 26 01:47:35 2020 -0500

    make sa.py check easy

commit 0aa835c
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Sat Dec 26 01:33:12 2020 -0500

    copy master gather.py

commit 189c32e
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Sat Dec 26 01:15:26 2020 -0500

    add an empty correctness function

commit 267238c
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Sat Dec 26 00:55:43 2020 -0500

    import SArrays class in pdarraysetops.py

commit 4f773e2
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Sat Dec 26 00:38:10 2020 -0500

    remove the bug causing wrong return string value

commit 6074b60
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Thu Dec 24 00:20:55 2020 -0500

    remove binary op

commit edc3f63
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Thu Dec 24 00:16:22 2020 -0500

    remove binary op

commit 3861e62
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Thu Dec 24 00:07:05 2020 -0500

    bool or pdarray

commit e1c3173
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Thu Dec 24 00:01:20 2020 -0500

    data type

commit 0cda91d
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Wed Dec 23 23:31:10 2020 -0500

    type match

commit 59174ac
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Wed Dec 23 22:47:58 2020 -0500

    correct a typo

commit 9d60563
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Wed Dec 23 22:20:13 2020 -0500

    align with strings function

commit 16cca77
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Wed Dec 23 21:49:28 2020 -0500

    remove unused import

commit a6c536e
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Tue Dec 22 11:18:51 2020 -0500

    update the SegSArray

commit 32ecc0c
Merge: 38818a3 18e8acf
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Tue Dec 22 10:59:10 2020 -0500

    solve the conflict

commit 38818a3
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Mon Dec 21 09:44:13 2020 -0500

    follow suggestions from community

commit 6b6e41a
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Sun Dec 20 18:08:04 2020 -0500

    update third party config

commit da30cfa
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Sun Dec 20 17:15:09 2020 -0500

    remove suffixarray_test.py

commit 2db17d8
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Sun Dec 20 15:47:18 2020 -0500

    datatype in string.py

commit e90ca27
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Sun Dec 20 14:56:33 2020 -0500

    remove tab in MultiTypeSymEntry.chpl

commit 7a0b197
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Sun Dec 20 14:44:20 2020 -0500

    suffix_arry_file updated

commit 9702d46
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Sun Dec 20 13:57:26 2020 -0500

    include sa.py into run_benchmarks.py

commit 42e3ba7
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Thu Dec 17 18:13:07 2020 -0500

    change to relative directory

commit b6228e5
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Mon Dec 14 23:17:34 2020 -0500

    remove tab, remove unused codes

commit 8f88f4c
Merge: c9e03fb d184048
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Mon Dec 14 14:28:09 2020 -0500

    Merge pull request Bears-R-Us#3 from alvaradoo/patch-1

    Update SACA.chpl comments

commit d184048
Author: Oliver Alvarado Rodriguez <41132909+alvaradoo@users.noreply.github.com>
Date:   Mon Dec 14 13:39:23 2020 -0500

    Update SACA.chpl comments

    reworded and added some comments

commit c9e03fb
Author: David Bader <dbader13@gmail.com>
Date:   Sun Dec 13 22:21:31 2020 -0500

    updated

commit 4131820
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Sun Dec 13 21:56:00 2020 -0500

    add thirdparty files

commit 3a220dc
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Sun Dec 13 21:29:48 2020 -0500

    solve conflict

commit 9f22f79
Merge: e517e36 940cb79
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Sun Dec 13 20:16:35 2020 -0500

    add thirdpary files

commit 940cb79
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Wed Dec 9 15:01:34 2020 -0500

    confirm submit all changes

commit 21df359
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Tue Dec 8 14:19:37 2020 -0500

    add the lcp array method

commit dbd6d96
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Mon Dec 7 19:35:43 2020 -0500

    add Chapel skew suffix array algorithm

commit c81d755
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Thu Nov 26 17:47:45 2020 -0500

    add suffix_array Python test

commit 9a22704
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Wed Nov 25 20:18:20 2020 -0500

    change name

commit d288c10
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Tue Nov 24 23:37:55 2020 -0500

    add read file suffix array function and all libdivsufsort files

commit 7af0b51
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Thu Nov 19 11:36:20 2020 -0500

    add suffix array benchmark sa.py

commit 65ad500
Author: Zhihui Du <zhihuidu@gmail.com>
Date:   Thu Nov 19 10:33:48 2020 -0500

    add the suffix array function to Arkouda
…e with master and pass tests:

- Updates logging statements to use logLevel instead of v (PR Bears-R-Us#760)
- Updates `check` to `check_table` (PR Bears-R-Us#792)
- Removes outdated `Strings.attach` function which used `_` for subcomponents (new convention is `.` per PR Bears-R-Us#774)
- Makes `unregister_strings_by_name` a staticmethod

Manually applied recent changes to `SegmentedArray.chpl` because automatic merge was thrown off by addition of `class SegSArray` including:
- PR Bears-R-Us#666 string test ends_with failure
- PR Bears-R-Us#781 off by one in SegmentedArray peel method
- Moves SArrarys to suffix_array.py
- Minimizes diff with master to highlight functionality changes
- Refomats suffix-array.py
- Reformats SACA.chpl to fix bracket alignment and to have more uniform formattting
- Removes unnecessary commented out code
- Add type hints to size and bytes in strings.py for mypy
- In SegmentedArray.chpl replaces writelns in SegSArray class with logging statements following the example of SegString class
- Renames `class SegSArray` to `class SegSuffixArray` and moved this class to SegmentedSuffixArray.chpl (previously in SegmentedArray.chpl along with SegString)
- Removes external `libdivsufsort` code and references. Changes default to @zhihuidu's native chapel suffix array construction implementation using skew algorithm
- Removes unused `fields` variable in _parse_single_int_array_value
- Passes Exception into ValueError in _parse_single_int_array_value
- Inverts logic of `in1d_int` to avoid duplicating the `ind1d` logic
- Adds `self.name` to `SArray` class and updates attach/register functions to use dots instead of dashs to be consistent with` the `Strings` class
- Removes unnecessary cast in `suffix_array_file`
- Changes typing from `sipHash128` and `computeSipHashLocalized` to accept int and uint(8)
- Updates suffix_array registration and info functions
- Renames SACA.chpl to SuffixArrayConstruction.chpl
- Adds UnitTestSuffixArrayConstruction.chpl
- Adds docstrings to SuffixArrayConstruction.chpl and removes duplicated logic by modifying `radixPass` to accept int or uint(8)
- Updates SegmentedSuffixArray to use ServerErrors per PR#882
@stress-tess
Copy link
Member Author

@zhihuidu, I rebased over master so the suffix_array branch is up to date with no merge conflicts. I also made some slight changes to SuffixArrayConstruction to shorten the code. I verified the functionality remained the same with python3 benchmarks/sa.py localhost 5555 --correctness-only

@zhihuidu
Copy link

@zhihuidu, I rebased over master so the suffix_array branch is up to date with no merge conflicts. I also made some slight changes to SuffixArrayConstruction to shorten the code. I verified the functionality remained the same with python3 benchmarks/sa.py localhost 5555 --correctness-only

@pierce314159 I appreciate it. In fact, you almost rewrite the code.

@mhmerrill
Copy link
Contributor

@pierce314159 @zhihuidu I would like to see this updated and moved to the arkouda-njit or arkouda-contrib so it can be included as an optional module in the arkouda build.

@zhihuidu
Copy link

zhihuidu commented Mar 3, 2022

@mhmerrill @pierce314159 No problem. I can rewrite the suffix array code based on Kyle's framework to integrate new functionalities and push it to arkouda-njit first.

@stress-tess
Copy link
Member Author

stress-tess commented Mar 3, 2022

@mhmerrill @pierce314159 No problem. I can rewrite the suffix array code based on Kyle's framework to integrate new functionalities and push it to arkouda-njit first.

@zhihuidu It might be easier than that, let me take a crack at it and see if I can use git magic to get it in line with the njit repo

@zhihuidu
Copy link

zhihuidu commented Mar 3, 2022

@pierce314159 If some tools can do it automatically, it will be better. Please go ahead.

@glitch
Copy link
Collaborator

glitch commented Mar 3, 2022

@mhmerrill @pierce314159 No problem. I can rewrite the suffix array code based on Kyle's framework to integrate new functionalities and push it to arkouda-njit first.

@zhihuidu It might be easier than that, let me take a crack at it and see if I can use git magic to get it in line with the njit repo

Note: It looked like @pierce314159 didn't have access to the repo, that should be fixed now. 👍

@mhmerrill
Copy link
Contributor

I am going to close this because we moved this to Bears-R-Us/arkouda-njit#3

@mhmerrill mhmerrill closed this Mar 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

move PR #865 to the Bears-R-Us/arkouda-njit repo or Bears-R-Us/arkouda-contrib repo and update it to use new modular build infrastructure.

6 participants