Skip to content

Commit 27ef362

Browse files
HyukjinKwondongjoon-hyun
authored andcommitted
[SPARK-32292][SPARK-32252][INFRA] Run the relevant tests only in GitHub Actions
### What changes were proposed in this pull request? This PR mainly proposes to run only relevant tests just like Jenkins PR builder does. Currently, GitHub Actions always run full tests which wastes the resources. In addition, this PR also fixes 3 more issues very closely related together while I am here. 1. The main idea here is: It reuses the existing logic embedded in `dev/run-tests.py` which Jenkins PR builder use in order to run only the related test cases. 2. While I am here, I fixed SPARK-32292 too to run the doc tests. It was because other references were not available when it is cloned via `checkoutv2`. With `fetch-depth: 0`, the history is available. 3. In addition, it fixes the `dev/run-tests.py` to match with `python/run-tests.py` in terms of its options. Environment variables such as `TEST_ONLY_XXX` were moved as proper options. For example, ```bash dev/run-tests.py --modules sql,core ``` which is consistent with `python/run-tests.py`, for example, ```bash python/run-tests.py --modules pyspark-core,pyspark-ml ``` 4. Lastly, also fixed the formatting issue in module specification in the matrix: ```diff - network_common, network_shuffle, repl, launcher + network-common, network-shuffle, repl, launcher, ``` which incorrectly runs build/test the modules. ### Why are the changes needed? By running only related tests, we can hugely save the resources and avoid unrelated flaky tests, etc. Also, now it runs the doctest of `dev/run-tests.py` properly, the usages are similar between `dev/run-tests.py` and `python/run-tests.py`, and run `network-common`, `network-shuffle`, `launcher` and `examples` modules too. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested in my own forked Spark: HyukjinKwon#7 HyukjinKwon#8 HyukjinKwon#9 HyukjinKwon#10 HyukjinKwon#11 HyukjinKwon#12 Closes #29086 from HyukjinKwon/SPARK-32292. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
1 parent 5521afb commit 27ef362

File tree

2 files changed

+96
-29
lines changed

2 files changed

+96
-29
lines changed

.github/workflows/master.yml

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ jobs:
2929
modules:
3030
- |-
3131
core, unsafe, kvstore, avro,
32-
network_common, network_shuffle, repl, launcher
32+
network-common, network-shuffle, repl, launcher,
3333
examples, sketch, graphx
3434
- |-
3535
catalyst, hive-thriftserver
@@ -75,16 +75,20 @@ jobs:
7575
excluded-tags: org.apache.spark.tags.ExtendedSQLTest
7676
comment: "- other tests"
7777
env:
78-
TEST_ONLY_MODULES: ${{ matrix.modules }}
79-
TEST_ONLY_EXCLUDED_TAGS: ${{ matrix.excluded-tags }}
80-
TEST_ONLY_INCLUDED_TAGS: ${{ matrix.included-tags }}
78+
MODULES_TO_TEST: ${{ matrix.modules }}
79+
EXCLUDED_TAGS: ${{ matrix.excluded-tags }}
80+
INCLUDED_TAGS: ${{ matrix.included-tags }}
8181
HADOOP_PROFILE: ${{ matrix.hadoop }}
8282
HIVE_PROFILE: ${{ matrix.hive }}
8383
# GitHub Actions' default miniconda to use in pip packaging test.
8484
CONDA_PREFIX: /usr/share/miniconda
85+
GITHUB_PREV_SHA: ${{ github.event.before }}
8586
steps:
8687
- name: Checkout Spark repository
8788
uses: actions/checkout@v2
89+
# In order to fetch changed files
90+
with:
91+
fetch-depth: 0
8892
# Cache local repositories. Note that GitHub Actions cache has a 2G limit.
8993
- name: Cache Scala, SBT, Maven and Zinc
9094
uses: actions/cache@v1
@@ -161,9 +165,9 @@ jobs:
161165
- name: "Run tests: ${{ matrix.modules }}"
162166
run: |
163167
# Hive tests become flaky when running in parallel as it's too intensive.
164-
if [[ "$TEST_ONLY_MODULES" == "hive" ]]; then export SERIAL_SBT_TESTS=1; fi
168+
if [[ "$MODULES_TO_TEST" == "hive" ]]; then export SERIAL_SBT_TESTS=1; fi
165169
mkdir -p ~/.m2
166-
./dev/run-tests --parallelism 2
170+
./dev/run-tests --parallelism 2 --modules "$MODULES_TO_TEST" --included-tags "$INCLUDED_TAGS" --excluded-tags "$EXCLUDED_TAGS"
167171
rm -rf ~/.m2/repository/org/apache/spark
168172
169173
# Static analysis, and documentation build

dev/run-tests.py

Lines changed: 86 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -101,12 +101,14 @@ def setup_test_environ(environ):
101101
os.environ[k] = v
102102

103103

104-
def determine_modules_to_test(changed_modules):
104+
def determine_modules_to_test(changed_modules, deduplicated=True):
105105
"""
106106
Given a set of modules that have changed, compute the transitive closure of those modules'
107107
dependent modules in order to determine the set of modules that should be tested.
108108
109109
Returns a topologically-sorted list of modules (ties are broken by sorting on module names).
110+
If ``deduplicated`` is disabled, the modules are returned without tacking the deduplication
111+
by dependencies into account.
110112
111113
>>> [x.name for x in determine_modules_to_test([modules.root])]
112114
['root']
@@ -122,11 +124,29 @@ def determine_modules_to_test(changed_modules):
122124
... # doctest: +NORMALIZE_WHITESPACE
123125
['sql', 'avro', 'hive', 'mllib', 'sql-kafka-0-10', 'examples', 'hive-thriftserver',
124126
'pyspark-sql', 'repl', 'sparkr', 'pyspark-mllib', 'pyspark-ml']
127+
>>> sorted([x.name for x in determine_modules_to_test(
128+
... [modules.sparkr, modules.sql], deduplicated=False)])
129+
... # doctest: +NORMALIZE_WHITESPACE
130+
['avro', 'examples', 'hive', 'hive-thriftserver', 'mllib', 'pyspark-ml',
131+
'pyspark-mllib', 'pyspark-sql', 'repl', 'sparkr', 'sql', 'sql-kafka-0-10']
132+
>>> sorted([x.name for x in determine_modules_to_test(
133+
... [modules.sql, modules.core], deduplicated=False)])
134+
... # doctest: +NORMALIZE_WHITESPACE
135+
['avro', 'catalyst', 'core', 'examples', 'graphx', 'hive', 'hive-thriftserver',
136+
'mllib', 'mllib-local', 'pyspark-core', 'pyspark-ml', 'pyspark-mllib',
137+
'pyspark-resource', 'pyspark-sql', 'pyspark-streaming', 'repl', 'root',
138+
'sparkr', 'sql', 'sql-kafka-0-10', 'streaming', 'streaming-kafka-0-10',
139+
'streaming-kinesis-asl']
125140
"""
126141
modules_to_test = set()
127142
for module in changed_modules:
128-
modules_to_test = modules_to_test.union(determine_modules_to_test(module.dependent_modules))
143+
modules_to_test = modules_to_test.union(
144+
determine_modules_to_test(module.dependent_modules, deduplicated))
129145
modules_to_test = modules_to_test.union(set(changed_modules))
146+
147+
if not deduplicated:
148+
return modules_to_test
149+
130150
# If we need to run all of the tests, then we should short-circuit and return 'root'
131151
if modules.root in modules_to_test:
132152
return [modules.root]
@@ -539,6 +559,24 @@ def parse_opts():
539559
"-p", "--parallelism", type=int, default=8,
540560
help="The number of suites to test in parallel (default %(default)d)"
541561
)
562+
parser.add_argument(
563+
"-m", "--modules", type=str,
564+
default=None,
565+
help="A comma-separated list of modules to test "
566+
"(default: %s)" % ",".join(sorted([m.name for m in modules.all_modules]))
567+
)
568+
parser.add_argument(
569+
"-e", "--excluded-tags", type=str,
570+
default=None,
571+
help="A comma-separated list of tags to exclude in the tests, "
572+
"e.g., org.apache.spark.tags.ExtendedHiveTest "
573+
)
574+
parser.add_argument(
575+
"-i", "--included-tags", type=str,
576+
default=None,
577+
help="A comma-separated list of tags to include in the tests, "
578+
"e.g., org.apache.spark.tags.ExtendedHiveTest "
579+
)
542580

543581
args, unknown = parser.parse_known_args()
544582
if unknown:
@@ -589,43 +627,74 @@ def main():
589627
# /home/jenkins/anaconda2/envs/py36/bin
590628
os.environ["PATH"] = "/home/anaconda/envs/py36/bin:" + os.environ.get("PATH")
591629
else:
592-
# else we're running locally and can use local settings
630+
# else we're running locally or Github Actions.
593631
build_tool = "sbt"
594632
hadoop_version = os.environ.get("HADOOP_PROFILE", "hadoop2.7")
595633
hive_version = os.environ.get("HIVE_PROFILE", "hive2.3")
596-
test_env = "local"
634+
if "GITHUB_ACTIONS" in os.environ:
635+
test_env = "github_actions"
636+
else:
637+
test_env = "local"
597638

598639
print("[info] Using build tool", build_tool, "with Hadoop profile", hadoop_version,
599640
"and Hive profile", hive_version, "under environment", test_env)
600641
extra_profiles = get_hadoop_profiles(hadoop_version) + get_hive_profiles(hive_version)
601642

602643
changed_modules = None
644+
test_modules = None
603645
changed_files = None
604-
should_only_test_modules = "TEST_ONLY_MODULES" in os.environ
646+
should_only_test_modules = opts.modules is not None
605647
included_tags = []
648+
excluded_tags = []
606649
if should_only_test_modules:
607-
str_test_modules = [m.strip() for m in os.environ.get("TEST_ONLY_MODULES").split(",")]
650+
str_test_modules = [m.strip() for m in opts.modules.split(",")]
608651
test_modules = [m for m in modules.all_modules if m.name in str_test_modules]
609-
# Directly uses test_modules as changed modules to apply tags and environments
610-
# as if all specified test modules are changed.
652+
653+
# If we're running the tests in Github Actions, attempt to detect and test
654+
# only the affected modules.
655+
if test_env == "github_actions":
656+
if os.environ["GITHUB_BASE_REF"] != "":
657+
# Pull requests
658+
changed_files = identify_changed_files_from_git_commits(
659+
os.environ["GITHUB_SHA"], target_branch=os.environ["GITHUB_BASE_REF"])
660+
else:
661+
# Build for each commit.
662+
changed_files = identify_changed_files_from_git_commits(
663+
os.environ["GITHUB_SHA"], target_ref=os.environ["GITHUB_PREV_SHA"])
664+
665+
modules_to_test = determine_modules_to_test(
666+
determine_modules_for_files(changed_files), deduplicated=False)
667+
668+
if modules.root not in modules_to_test:
669+
# If root module is not found, only test the intersected modules.
670+
# If root module is found, just run the modules as specified initially.
671+
test_modules = list(set(modules_to_test).intersection(test_modules))
672+
611673
changed_modules = test_modules
612-
str_excluded_tags = os.environ.get("TEST_ONLY_EXCLUDED_TAGS", None)
613-
str_included_tags = os.environ.get("TEST_ONLY_INCLUDED_TAGS", None)
614-
excluded_tags = []
615-
if str_excluded_tags:
616-
excluded_tags = [t.strip() for t in str_excluded_tags.split(",")]
617-
included_tags = []
618-
if str_included_tags:
619-
included_tags = [t.strip() for t in str_included_tags.split(",")]
674+
if len(changed_modules) == 0:
675+
print("[info] There are no modules to test, exiting without testing.")
676+
return
677+
678+
# If we're running the tests in AMPLab Jenkins, calculate the diff from the targeted branch, and
679+
# detect modules to test.
620680
elif test_env == "amplab_jenkins" and os.environ.get("AMP_JENKINS_PRB"):
621681
target_branch = os.environ["ghprbTargetBranch"]
622682
changed_files = identify_changed_files_from_git_commits("HEAD", target_branch=target_branch)
623683
changed_modules = determine_modules_for_files(changed_files)
684+
test_modules = determine_modules_to_test(changed_modules)
624685
excluded_tags = determine_tags_to_exclude(changed_modules)
625686

687+
# If there is no changed module found, tests all.
626688
if not changed_modules:
627689
changed_modules = [modules.root]
628-
excluded_tags = []
690+
if not test_modules:
691+
test_modules = determine_modules_to_test(changed_modules)
692+
693+
if opts.excluded_tags:
694+
excluded_tags.extend([t.strip() for t in opts.excluded_tags.split(",")])
695+
if opts.included_tags:
696+
included_tags.extend([t.strip() for t in opts.included_tags.split(",")])
697+
629698
print("[info] Found the following changed modules:",
630699
", ".join(x.name for x in changed_modules))
631700

@@ -640,8 +709,6 @@ def main():
640709

641710
should_run_java_style_checks = False
642711
if not should_only_test_modules:
643-
test_modules = determine_modules_to_test(changed_modules)
644-
645712
# license checks
646713
run_apache_rat_checks()
647714

@@ -702,10 +769,6 @@ def main():
702769

703770

704771
def _test():
705-
if "TEST_ONLY_MODULES" in os.environ:
706-
# TODO(SPARK-32252): Enable doctests back in Github Actions.
707-
return
708-
709772
import doctest
710773
failure_count = doctest.testmod()[0]
711774
if failure_count:

0 commit comments

Comments
 (0)