Skip to content

Commit 4b8e34e

Browse files
committed
[SPARK-32292][SPARK-32252][INFRA] Run the relevant tests only in GitHub Actions
This PR mainly proposes to run only relevant tests just like Jenkins PR builder does. Currently, GitHub Actions always run full tests which wastes the resources. In addition, this PR also fixes 3 more issues very closely related together while I am here. 1. The main idea here is: It reuses the existing logic embedded in `dev/run-tests.py` which Jenkins PR builder use in order to run only the related test cases. 2. While I am here, I fixed SPARK-32292 too to run the doc tests. It was because other references were not available when it is cloned via `checkoutv2`. With `fetch-depth: 0`, the history is available. 3. In addition, it fixes the `dev/run-tests.py` to match with `python/run-tests.py` in terms of its options. Environment variables such as `TEST_ONLY_XXX` were moved as proper options. For example, ```bash dev/run-tests.py --modules sql,core ``` which is consistent with `python/run-tests.py`, for example, ```bash python/run-tests.py --modules pyspark-core,pyspark-ml ``` 4. Lastly, also fixed the formatting issue in module specification in the matrix: ```diff - network_common, network_shuffle, repl, launcher + network-common, network-shuffle, repl, launcher, ``` which incorrectly runs build/test the modules. By running only related tests, we can hugely save the resources and avoid unrelated flaky tests, etc. Also, now it runs the doctest of `dev/run-tests.py` properly, the usages are similar between `dev/run-tests.py` and `python/run-tests.py`, and run `network-common`, `network-shuffle`, `launcher` and `examples` modules too. No, dev-only. Manually tested in my own forked Spark: #7 #8 #9 #10 #11 #12 Closes apache#29086 from HyukjinKwon/SPARK-32292. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
1 parent 8ce7c0c commit 4b8e34e

File tree

2 files changed

+96
-29
lines changed

2 files changed

+96
-29
lines changed

.github/workflows/master.yml

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ jobs:
2929
modules:
3030
- |-
3131
core, unsafe, kvstore, avro,
32-
network_common, network_shuffle, repl, launcher
32+
network-common, network-shuffle, repl, launcher,
3333
examples, sketch, graphx
3434
- |-
3535
catalyst, hive-thriftserver
@@ -75,16 +75,20 @@ jobs:
7575
excluded-tags: org.apache.spark.tags.ExtendedSQLTest
7676
comment: "- other tests"
7777
env:
78-
TEST_ONLY_MODULES: ${{ matrix.modules }}
79-
TEST_ONLY_EXCLUDED_TAGS: ${{ matrix.excluded-tags }}
80-
TEST_ONLY_INCLUDED_TAGS: ${{ matrix.included-tags }}
78+
MODULES_TO_TEST: ${{ matrix.modules }}
79+
EXCLUDED_TAGS: ${{ matrix.excluded-tags }}
80+
INCLUDED_TAGS: ${{ matrix.included-tags }}
8181
HADOOP_PROFILE: ${{ matrix.hadoop }}
8282
HIVE_PROFILE: ${{ matrix.hive }}
8383
# GitHub Actions' default miniconda to use in pip packaging test.
8484
CONDA_PREFIX: /usr/share/miniconda
85+
GITHUB_PREV_SHA: ${{ github.event.before }}
8586
steps:
8687
- name: Checkout Spark repository
8788
uses: actions/checkout@v2
89+
# In order to fetch changed files
90+
with:
91+
fetch-depth: 0
8892
# Cache local repositories. Note that GitHub Actions cache has a 2G limit.
8993
- name: Cache Scala, SBT, Maven and Zinc
9094
uses: actions/cache@v1
@@ -161,9 +165,9 @@ jobs:
161165
- name: "Run tests: ${{ matrix.modules }}"
162166
run: |
163167
# Hive tests become flaky when running in parallel as it's too intensive.
164-
if [[ "$TEST_ONLY_MODULES" == "hive" ]]; then export SERIAL_SBT_TESTS=1; fi
168+
if [[ "$MODULES_TO_TEST" == "hive" ]]; then export SERIAL_SBT_TESTS=1; fi
165169
mkdir -p ~/.m2
166-
./dev/run-tests --parallelism 2
170+
./dev/run-tests --parallelism 2 --modules "$MODULES_TO_TEST" --included-tags "$INCLUDED_TAGS" --excluded-tags "$EXCLUDED_TAGS"
167171
rm -rf ~/.m2/repository/org/apache/spark
168172
169173
# Static analysis, and documentation build

dev/run-tests.py

Lines changed: 86 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -101,12 +101,14 @@ def setup_test_environ(environ):
101101
os.environ[k] = v
102102

103103

104-
def determine_modules_to_test(changed_modules):
104+
def determine_modules_to_test(changed_modules, deduplicated=True):
105105
"""
106106
Given a set of modules that have changed, compute the transitive closure of those modules'
107107
dependent modules in order to determine the set of modules that should be tested.
108108
109109
Returns a topologically-sorted list of modules (ties are broken by sorting on module names).
110+
If ``deduplicated`` is disabled, the modules are returned without tacking the deduplication
111+
by dependencies into account.
110112
111113
>>> [x.name for x in determine_modules_to_test([modules.root])]
112114
['root']
@@ -122,11 +124,29 @@ def determine_modules_to_test(changed_modules):
122124
... # doctest: +NORMALIZE_WHITESPACE
123125
['sql', 'avro', 'hive', 'mllib', 'sql-kafka-0-10', 'examples', 'hive-thriftserver',
124126
'pyspark-sql', 'repl', 'sparkr', 'pyspark-mllib', 'pyspark-ml']
127+
>>> sorted([x.name for x in determine_modules_to_test(
128+
... [modules.sparkr, modules.sql], deduplicated=False)])
129+
... # doctest: +NORMALIZE_WHITESPACE
130+
['avro', 'examples', 'hive', 'hive-thriftserver', 'mllib', 'pyspark-ml',
131+
'pyspark-mllib', 'pyspark-sql', 'repl', 'sparkr', 'sql', 'sql-kafka-0-10']
132+
>>> sorted([x.name for x in determine_modules_to_test(
133+
... [modules.sql, modules.core], deduplicated=False)])
134+
... # doctest: +NORMALIZE_WHITESPACE
135+
['avro', 'catalyst', 'core', 'examples', 'graphx', 'hive', 'hive-thriftserver',
136+
'mllib', 'mllib-local', 'pyspark-core', 'pyspark-ml', 'pyspark-mllib',
137+
'pyspark-sql', 'pyspark-streaming', 'repl', 'root',
138+
'sparkr', 'sql', 'sql-kafka-0-10', 'streaming', 'streaming-kafka-0-10',
139+
'streaming-kinesis-asl']
125140
"""
126141
modules_to_test = set()
127142
for module in changed_modules:
128-
modules_to_test = modules_to_test.union(determine_modules_to_test(module.dependent_modules))
143+
modules_to_test = modules_to_test.union(
144+
determine_modules_to_test(module.dependent_modules, deduplicated))
129145
modules_to_test = modules_to_test.union(set(changed_modules))
146+
147+
if not deduplicated:
148+
return modules_to_test
149+
130150
# If we need to run all of the tests, then we should short-circuit and return 'root'
131151
if modules.root in modules_to_test:
132152
return [modules.root]
@@ -538,6 +558,24 @@ def parse_opts():
538558
"-p", "--parallelism", type=int, default=8,
539559
help="The number of suites to test in parallel (default %(default)d)"
540560
)
561+
parser.add_argument(
562+
"-m", "--modules", type=str,
563+
default=None,
564+
help="A comma-separated list of modules to test "
565+
"(default: %s)" % ",".join(sorted([m.name for m in modules.all_modules]))
566+
)
567+
parser.add_argument(
568+
"-e", "--excluded-tags", type=str,
569+
default=None,
570+
help="A comma-separated list of tags to exclude in the tests, "
571+
"e.g., org.apache.spark.tags.ExtendedHiveTest "
572+
)
573+
parser.add_argument(
574+
"-i", "--included-tags", type=str,
575+
default=None,
576+
help="A comma-separated list of tags to include in the tests, "
577+
"e.g., org.apache.spark.tags.ExtendedHiveTest "
578+
)
541579

542580
args, unknown = parser.parse_known_args()
543581
if unknown:
@@ -588,43 +626,74 @@ def main():
588626
# /home/jenkins/anaconda2/envs/py36/bin
589627
os.environ["PATH"] = "/home/anaconda/envs/py36/bin:" + os.environ.get("PATH")
590628
else:
591-
# else we're running locally and can use local settings
629+
# else we're running locally or Github Actions.
592630
build_tool = "sbt"
593631
hadoop_version = os.environ.get("HADOOP_PROFILE", "hadoop2.7")
594632
hive_version = os.environ.get("HIVE_PROFILE", "hive2.3")
595-
test_env = "local"
633+
if "GITHUB_ACTIONS" in os.environ:
634+
test_env = "github_actions"
635+
else:
636+
test_env = "local"
596637

597638
print("[info] Using build tool", build_tool, "with Hadoop profile", hadoop_version,
598639
"and Hive profile", hive_version, "under environment", test_env)
599640
extra_profiles = get_hadoop_profiles(hadoop_version) + get_hive_profiles(hive_version)
600641

601642
changed_modules = None
643+
test_modules = None
602644
changed_files = None
603-
should_only_test_modules = "TEST_ONLY_MODULES" in os.environ
645+
should_only_test_modules = opts.modules is not None
604646
included_tags = []
647+
excluded_tags = []
605648
if should_only_test_modules:
606-
str_test_modules = [m.strip() for m in os.environ.get("TEST_ONLY_MODULES").split(",")]
649+
str_test_modules = [m.strip() for m in opts.modules.split(",")]
607650
test_modules = [m for m in modules.all_modules if m.name in str_test_modules]
608-
# Directly uses test_modules as changed modules to apply tags and environments
609-
# as if all specified test modules are changed.
651+
652+
# If we're running the tests in Github Actions, attempt to detect and test
653+
# only the affected modules.
654+
if test_env == "github_actions":
655+
if os.environ["GITHUB_BASE_REF"] != "":
656+
# Pull requests
657+
changed_files = identify_changed_files_from_git_commits(
658+
os.environ["GITHUB_SHA"], target_branch=os.environ["GITHUB_BASE_REF"])
659+
else:
660+
# Build for each commit.
661+
changed_files = identify_changed_files_from_git_commits(
662+
os.environ["GITHUB_SHA"], target_ref=os.environ["GITHUB_PREV_SHA"])
663+
664+
modules_to_test = determine_modules_to_test(
665+
determine_modules_for_files(changed_files), deduplicated=False)
666+
667+
if modules.root not in modules_to_test:
668+
# If root module is not found, only test the intersected modules.
669+
# If root module is found, just run the modules as specified initially.
670+
test_modules = list(set(modules_to_test).intersection(test_modules))
671+
610672
changed_modules = test_modules
611-
str_excluded_tags = os.environ.get("TEST_ONLY_EXCLUDED_TAGS", None)
612-
str_included_tags = os.environ.get("TEST_ONLY_INCLUDED_TAGS", None)
613-
excluded_tags = []
614-
if str_excluded_tags:
615-
excluded_tags = [t.strip() for t in str_excluded_tags.split(",")]
616-
included_tags = []
617-
if str_included_tags:
618-
included_tags = [t.strip() for t in str_included_tags.split(",")]
673+
if len(changed_modules) == 0:
674+
print("[info] There are no modules to test, exiting without testing.")
675+
return
676+
677+
# If we're running the tests in AMPLab Jenkins, calculate the diff from the targeted branch, and
678+
# detect modules to test.
619679
elif test_env == "amplab_jenkins" and os.environ.get("AMP_JENKINS_PRB"):
620680
target_branch = os.environ["ghprbTargetBranch"]
621681
changed_files = identify_changed_files_from_git_commits("HEAD", target_branch=target_branch)
622682
changed_modules = determine_modules_for_files(changed_files)
683+
test_modules = determine_modules_to_test(changed_modules)
623684
excluded_tags = determine_tags_to_exclude(changed_modules)
624685

686+
# If there is no changed module found, tests all.
625687
if not changed_modules:
626688
changed_modules = [modules.root]
627-
excluded_tags = []
689+
if not test_modules:
690+
test_modules = determine_modules_to_test(changed_modules)
691+
692+
if opts.excluded_tags:
693+
excluded_tags.extend([t.strip() for t in opts.excluded_tags.split(",")])
694+
if opts.included_tags:
695+
included_tags.extend([t.strip() for t in opts.included_tags.split(",")])
696+
628697
print("[info] Found the following changed modules:",
629698
", ".join(x.name for x in changed_modules))
630699

@@ -639,8 +708,6 @@ def main():
639708

640709
should_run_java_style_checks = False
641710
if not should_only_test_modules:
642-
test_modules = determine_modules_to_test(changed_modules)
643-
644711
# license checks
645712
run_apache_rat_checks()
646713

@@ -701,10 +768,6 @@ def main():
701768

702769

703770
def _test():
704-
if "TEST_ONLY_MODULES" in os.environ:
705-
# TODO(SPARK-32252): Enable doctests back in Github Actions.
706-
return
707-
708771
import doctest
709772
failure_count = doctest.testmod()[0]
710773
if failure_count:

0 commit comments

Comments
 (0)