Skip to content

Commit c6bbdf6

Browse files
committed
[SPARK-32292][SPARK-32252][INFRA] Run the relevant tests only in GitHub Actions
This PR mainly proposes to run only relevant tests just like Jenkins PR builder does. Currently, GitHub Actions always run full tests which wastes the resources. In addition, this PR also fixes 3 more issues very closely related together while I am here. 1. The main idea here is: It reuses the existing logic embedded in `dev/run-tests.py` which Jenkins PR builder use in order to run only the related test cases. 2. While I am here, I fixed SPARK-32292 too to run the doc tests. It was because other references were not available when it is cloned via `checkoutv2`. With `fetch-depth: 0`, the history is available. 3. In addition, it fixes the `dev/run-tests.py` to match with `python/run-tests.py` in terms of its options. Environment variables such as `TEST_ONLY_XXX` were moved as proper options. For example, ```bash dev/run-tests.py --modules sql,core ``` which is consistent with `python/run-tests.py`, for example, ```bash python/run-tests.py --modules pyspark-core,pyspark-ml ``` 4. Lastly, also fixed the formatting issue in module specification in the matrix: ```diff - network_common, network_shuffle, repl, launcher + network-common, network-shuffle, repl, launcher, ``` which incorrectly runs build/test the modules. By running only related tests, we can hugely save the resources and avoid unrelated flaky tests, etc. Also, now it runs the doctest of `dev/run-tests.py` properly, the usages are similar between `dev/run-tests.py` and `python/run-tests.py`, and run `network-common`, `network-shuffle`, `launcher` and `examples` modules too. No, dev-only. Manually tested in my own forked Spark: #7 #8 #9 #10 #11 #12 Closes apache#29086 from HyukjinKwon/SPARK-32292. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
1 parent c872ede commit c6bbdf6

File tree

2 files changed

+97
-29
lines changed

2 files changed

+97
-29
lines changed

.github/workflows/master.yml

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ jobs:
2727
modules:
2828
- |-
2929
core, unsafe, kvstore, avro,
30-
network_common, network_shuffle, repl, launcher
30+
network-common, network-shuffle, repl, launcher,
3131
examples, sketch, graphx
3232
- |-
3333
catalyst, hive-thriftserver
@@ -69,15 +69,19 @@ jobs:
6969
excluded-tags: org.apache.spark.tags.ExtendedSQLTest
7070
comment: "- other tests"
7171
env:
72-
TEST_ONLY_MODULES: ${{ matrix.modules }}
73-
TEST_ONLY_EXCLUDED_TAGS: ${{ matrix.excluded-tags }}
74-
TEST_ONLY_INCLUDED_TAGS: ${{ matrix.included-tags }}
72+
MODULES_TO_TEST: ${{ matrix.modules }}
73+
EXCLUDED_TAGS: ${{ matrix.excluded-tags }}
74+
INCLUDED_TAGS: ${{ matrix.included-tags }}
7575
HADOOP_PROFILE: ${{ matrix.hadoop }}
7676
# GitHub Actions' default miniconda to use in pip packaging test.
7777
CONDA_PREFIX: /usr/share/miniconda
78+
GITHUB_PREV_SHA: ${{ github.event.before }}
7879
steps:
7980
- name: Checkout Spark repository
8081
uses: actions/checkout@v2
82+
# In order to fetch changed files
83+
with:
84+
fetch-depth: 0
8185
# Cache local repositories. Note that GitHub Actions cache has a 2G limit.
8286
- name: Cache Scala, SBT, Maven and Zinc
8387
uses: actions/cache@v1
@@ -154,9 +158,9 @@ jobs:
154158
- name: "Run tests: ${{ matrix.modules }}"
155159
run: |
156160
# Hive tests become flaky when running in parallel as it's too intensive.
157-
if [[ "$TEST_ONLY_MODULES" == "hive" ]]; then export SERIAL_SBT_TESTS=1; fi
161+
if [[ "$MODULES_TO_TEST" == "hive" ]]; then export SERIAL_SBT_TESTS=1; fi
158162
mkdir -p ~/.m2
159-
./dev/run-tests --parallelism 2
163+
./dev/run-tests --parallelism 2 --modules "$MODULES_TO_TEST" --included-tags "$INCLUDED_TAGS" --excluded-tags "$EXCLUDED_TAGS"
160164
rm -rf ~/.m2/repository/org/apache/spark
161165
162166
# Static analysis, and documentation build

dev/run-tests.py

Lines changed: 87 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -100,12 +100,14 @@ def setup_test_environ(environ):
100100
os.environ[k] = v
101101

102102

103-
def determine_modules_to_test(changed_modules):
103+
def determine_modules_to_test(changed_modules, deduplicated=True):
104104
"""
105105
Given a set of modules that have changed, compute the transitive closure of those modules'
106106
dependent modules in order to determine the set of modules that should be tested.
107107
108108
Returns a topologically-sorted list of modules (ties are broken by sorting on module names).
109+
If ``deduplicated`` is disabled, the modules are returned without tacking the deduplication
110+
by dependencies into account.
109111
110112
>>> [x.name for x in determine_modules_to_test([modules.root])]
111113
['root']
@@ -121,11 +123,29 @@ def determine_modules_to_test(changed_modules):
121123
... # doctest: +NORMALIZE_WHITESPACE
122124
['sql', 'avro', 'hive', 'mllib', 'sql-kafka-0-10', 'examples', 'hive-thriftserver',
123125
'pyspark-sql', 'repl', 'sparkr', 'pyspark-mllib', 'pyspark-ml']
126+
>>> sorted([x.name for x in determine_modules_to_test(
127+
... [modules.sparkr, modules.sql], deduplicated=False)])
128+
... # doctest: +NORMALIZE_WHITESPACE
129+
['avro', 'examples', 'hive', 'hive-thriftserver', 'mllib', 'pyspark-ml',
130+
'pyspark-mllib', 'pyspark-sql', 'repl', 'sparkr', 'sql', 'sql-kafka-0-10']
131+
>>> sorted([x.name for x in determine_modules_to_test(
132+
... [modules.sql, modules.core], deduplicated=False)])
133+
... # doctest: +NORMALIZE_WHITESPACE
134+
['avro', 'catalyst', 'core', 'examples', 'graphx', 'hive', 'hive-thriftserver',
135+
'mllib', 'mllib-local', 'pyspark-core', 'pyspark-ml', 'pyspark-mllib',
136+
'pyspark-sql', 'pyspark-streaming', 'repl', 'root',
137+
'sparkr', 'sql', 'sql-kafka-0-10', 'streaming', 'streaming-kafka-0-10',
138+
'streaming-kinesis-asl']
124139
"""
125140
modules_to_test = set()
126141
for module in changed_modules:
127-
modules_to_test = modules_to_test.union(determine_modules_to_test(module.dependent_modules))
142+
modules_to_test = modules_to_test.union(
143+
determine_modules_to_test(module.dependent_modules, deduplicated))
128144
modules_to_test = modules_to_test.union(set(changed_modules))
145+
146+
if not deduplicated:
147+
return modules_to_test
148+
129149
# If we need to run all of the tests, then we should short-circuit and return 'root'
130150
if modules.root in modules_to_test:
131151
return [modules.root]
@@ -485,6 +505,24 @@ def parse_opts():
485505
"-p", "--parallelism", type="int", default=4,
486506
help="The number of suites to test in parallel (default %default)"
487507
)
508+
parser.add_option(
509+
"-m", "--modules", type="str",
510+
default=None,
511+
help="A comma-separated list of modules to test "
512+
"(default: %s)" % ",".join(sorted([m.name for m in modules.all_modules]))
513+
)
514+
parser.add_option(
515+
"-e", "--excluded-tags", type="str",
516+
default=None,
517+
help="A comma-separated list of tags to exclude in the tests, "
518+
"e.g., org.apache.spark.tags.ExtendedHiveTest "
519+
)
520+
parser.add_option(
521+
"-i", "--included-tags", type="str",
522+
default=None,
523+
help="A comma-separated list of tags to include in the tests, "
524+
"e.g., org.apache.spark.tags.ExtendedHiveTest "
525+
)
488526

489527
(opts, args) = parser.parse_args()
490528
if args:
@@ -534,40 +572,72 @@ def main():
534572
# add path for Python3 in Jenkins if we're calling from a Jenkins machine
535573
os.environ["PATH"] = "/home/anaconda/envs/py3k/bin:" + os.environ.get("PATH")
536574
else:
537-
# else we're running locally and can use local settings
575+
# else we're running locally or Github Actions.
538576
build_tool = "sbt"
539577
hadoop_version = os.environ.get("HADOOP_PROFILE", "hadoop2.6")
540-
test_env = "local"
578+
if "GITHUB_ACTIONS" in os.environ:
579+
test_env = "github_actions"
580+
else:
581+
test_env = "local"
541582

542583
print("[info] Using build tool", build_tool, "with Hadoop profile", hadoop_version,
543584
"under environment", test_env)
544585

545586
changed_modules = None
587+
test_modules = None
546588
changed_files = None
547-
should_only_test_modules = "TEST_ONLY_MODULES" in os.environ
589+
should_only_test_modules = opts.modules is not None
548590
included_tags = []
591+
excluded_tags = []
549592
if should_only_test_modules:
550-
str_test_modules = [m.strip() for m in os.environ.get("TEST_ONLY_MODULES").split(",")]
593+
str_test_modules = [m.strip() for m in opts.modules.split(",")]
551594
test_modules = [m for m in modules.all_modules if m.name in str_test_modules]
552-
# Directly uses test_modules as changed modules to apply tags and environments
553-
# as if all specified test modules are changed.
595+
596+
# If we're running the tests in Github Actions, attempt to detect and test
597+
# only the affected modules.
598+
if test_env == "github_actions":
599+
if os.environ["GITHUB_BASE_REF"] != "":
600+
# Pull requests
601+
changed_files = identify_changed_files_from_git_commits(
602+
os.environ["GITHUB_SHA"], target_branch=os.environ["GITHUB_BASE_REF"])
603+
else:
604+
# Build for each commit.
605+
changed_files = identify_changed_files_from_git_commits(
606+
os.environ["GITHUB_SHA"], target_ref=os.environ["GITHUB_PREV_SHA"])
607+
608+
modules_to_test = determine_modules_to_test(
609+
determine_modules_for_files(changed_files), deduplicated=False)
610+
611+
if modules.root not in modules_to_test:
612+
# If root module is not found, only test the intersected modules.
613+
# If root module is found, just run the modules as specified initially.
614+
test_modules = list(set(modules_to_test).intersection(test_modules))
615+
554616
changed_modules = test_modules
555-
str_excluded_tags = os.environ.get("TEST_ONLY_EXCLUDED_TAGS", None)
556-
str_included_tags = os.environ.get("TEST_ONLY_INCLUDED_TAGS", None)
557-
excluded_tags = []
558-
if str_excluded_tags:
559-
excluded_tags = [t.strip() for t in str_excluded_tags.split(",")]
560-
included_tags = []
561-
if str_included_tags:
562-
included_tags = [t.strip() for t in str_included_tags.split(",")]
617+
if len(changed_modules) == 0:
618+
print("[info] There are no modules to test, exiting without testing.")
619+
return
620+
621+
# If we're running the tests in AMPLab Jenkins, calculate the diff from the targeted branch, and
622+
# detect modules to test.
563623
elif test_env == "amplab_jenkins" and os.environ.get("AMP_JENKINS_PRB"):
564624
target_branch = os.environ["ghprbTargetBranch"]
565625
changed_files = identify_changed_files_from_git_commits("HEAD", target_branch=target_branch)
566626
changed_modules = determine_modules_for_files(changed_files)
627+
test_modules = determine_modules_to_test(changed_modules)
567628
excluded_tags = determine_tags_to_exclude(changed_modules)
629+
630+
# If there is no changed module found, tests all.
568631
if not changed_modules:
569632
changed_modules = [modules.root]
570-
excluded_tags = []
633+
if not test_modules:
634+
test_modules = determine_modules_to_test(changed_modules)
635+
636+
if opts.excluded_tags:
637+
excluded_tags.extend([t.strip() for t in opts.excluded_tags.split(",")])
638+
if opts.included_tags:
639+
included_tags.extend([t.strip() for t in opts.included_tags.split(",")])
640+
571641
print("[info] Found the following changed modules:",
572642
", ".join(x.name for x in changed_modules))
573643

@@ -582,8 +652,6 @@ def main():
582652

583653
should_run_java_style_checks = False
584654
if not should_only_test_modules:
585-
test_modules = determine_modules_to_test(changed_modules)
586-
587655
# license checks
588656
run_apache_rat_checks()
589657

@@ -640,10 +708,6 @@ def main():
640708

641709

642710
def _test():
643-
if "TEST_ONLY_MODULES" in os.environ:
644-
# TODO(SPARK-32252): Enable doctests back in Github Actions.
645-
return
646-
647711
import doctest
648712
failure_count = doctest.testmod()[0]
649713
if failure_count:

0 commit comments

Comments
 (0)