Skip to content

Commit

Permalink
Merge tag 'pm-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/…
Browse files Browse the repository at this point in the history
…git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These update cpufreq (core and drivers), cpuidle (polling state
  implementation and the PSCI driver), the OPP (operating performance
  points) framework, devfreq (core and drivers), the power capping RAPL
  (Running Average Power Limit) driver, the Energy Model support, the
  generic power domains (genpd) framework, the ACPI device power
  management, the core system-wide suspend code and power management
  utilities.

  Specifics:

   - Use local_clock() instead of jiffies in the cpufreq statistics to
     improve accuracy (Viresh Kumar).

   - Fix up OPP usage in the cpufreq-dt and qcom-cpufreq-nvmem cpufreq
     drivers (Viresh Kumar).

   - Clean up the cpufreq core, the intel_pstate driver and the
     schedutil cpufreq governor (Rafael Wysocki).

   - Fix up error code paths in the sti-cpufreq and mediatek cpufreq
     drivers (Yangtao Li, Qinglang Miao).

   - Fix cpufreq_online() to return error codes instead of success (0)
     in all cases when it fails (Wang ShaoBo).

   - Add mt8167 support to the mediatek cpufreq driver and blacklist
     mt8516 in the cpufreq-dt-platdev driver (Fabien Parent).

   - Modify the tegra194 cpufreq driver to always return values from the
     frequency table as the current frequency and clean up that driver
     (Sumit Gupta, Jon Hunter).

   - Modify the arm_scmi cpufreq driver to allow it to discover the
     power scale present in the performance protocol and provide this
     information to the Energy Model (Lukasz Luba).

   - Add missing MODULE_DEVICE_TABLE to several cpufreq drivers (Pali
     Rohár).

   - Clean up the CPPC cpufreq driver (Ionela Voinescu).

   - Fix NVMEM_IMX_OCOTP dependency in the imx cpufreq driver (Arnd
     Bergmann).

   - Rework the poling interval selection for the polling state in
     cpuidle (Mel Gorman).

   - Enable suspend-to-idle for PSCI OSI mode in the PSCI cpuidle driver
     (Ulf Hansson).

   - Modify the OPP framework to support empty (node-less) OPP tables in
     DT for passing dependency information (Nicola Mazzucato).

   - Fix potential lockdep issue in the OPP core and clean up the OPP
     core (Viresh Kumar).

   - Modify dev_pm_opp_put_regulators() to accept a NULL argument and
     update its users accordingly (Viresh Kumar).

   - Add frequency changes tracepoint to devfreq (Matthias Kaehlcke).

   - Add support for governor feature flags to devfreq, make devfreq
     sysfs file permissions depend on the governor and clean up the
     devfreq core (Chanwoo Choi).

   - Clean up the tegra20 devfreq driver and deprecate it to allow
     another driver based on EMC_STAT to be used instead of it (Dmitry
     Osipenko).

   - Add interconnect support to the tegra30 devfreq driver, allow it to
     take the interconnect and OPP information from DT and clean it up
     (Dmitry Osipenko).

   - Add interconnect support to the exynos-bus devfreq driver along
     with interconnect properties documentation (Sylwester Nawrocki).

   - Add suport for AMD Fam17h and Fam19h processors to the RAPL power
     capping driver (Victor Ding, Kim Phillips).

   - Fix handling of overly long constraint names in the powercap
     framework (Lukasz Luba).

   - Fix the wakeup configuration handling for bridges in the ACPI
     device power management core (Rafael Wysocki).

   - Add support for using an abstract scale for power units in the
     Energy Model (EM) and document it (Lukasz Luba).

   - Add em_cpu_energy() micro-optimization to the EM (Pavankumar
     Kondeti).

   - Modify the generic power domains (genpd) framwework to support
     suspend-to-idle (Ulf Hansson).

   - Fix creation of debugfs nodes in genpd (Thierry Strudel).

   - Clean up genpd (Lina Iyer).

   - Clean up the core system-wide suspend code and make it print driver
     flags for devices with debug enabled (Alex Shi, Patrice Chotard,
     Chen Yu).

   - Modify the ACPI system reboot code to make it prepare for system
     power off to avoid confusing the platform firmware (Kai-Heng Feng).

   - Update the pm-graph (multiple changes, mostly usability-related)
     and cpupower (online and offline CPU information support) PM
     utilities (Todd Brandt, Brahadambal Srinivasan)"

* tag 'pm-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (86 commits)
  cpufreq: Fix cpufreq_online() return value on errors
  cpufreq: Fix up several kerneldoc comments
  cpufreq: stats: Use local_clock() instead of jiffies
  cpufreq: schedutil: Simplify sugov_update_next_freq()
  cpufreq: intel_pstate: Simplify intel_cpufreq_update_pstate()
  PM: domains: create debugfs nodes when adding power domains
  opp: of: Allow empty opp-table with opp-shared
  dt-bindings: opp: Allow empty OPP tables
  media: venus: dev_pm_opp_put_*() accepts NULL argument
  drm/panfrost: dev_pm_opp_put_*() accepts NULL argument
  drm/lima: dev_pm_opp_put_*() accepts NULL argument
  PM / devfreq: exynos: dev_pm_opp_put_*() accepts NULL argument
  cpufreq: qcom-cpufreq-nvmem: dev_pm_opp_put_*() accepts NULL argument
  cpufreq: dt: dev_pm_opp_put_regulators() accepts NULL argument
  opp: Allow dev_pm_opp_put_*() APIs to accept NULL opp_table
  opp: Don't create an OPP table from dev_pm_opp_get_opp_table()
  cpufreq: dt: Don't (ab)use dev_pm_opp_get_opp_table() to create OPP table
  opp: Reduce the size of critical section in _opp_kref_release()
  PM / EM: Micro optimization in em_cpu_energy
  cpufreq: arm_scmi: Discover the power scale in performance protocol
  ...
  • Loading branch information
torvalds committed Dec 16, 2020
2 parents b109bc7 + b3fac81 commit b4ec805
Show file tree
Hide file tree
Showing 80 changed files with 1,692 additions and 1,210 deletions.
54 changes: 32 additions & 22 deletions Documentation/ABI/testing/sysfs-class-devfreq
Original file line number Diff line number Diff line change
Expand Up @@ -37,20 +37,6 @@ Description:
The /sys/class/devfreq/.../target_freq shows the next governor
predicted target frequency of the corresponding devfreq object.

What: /sys/class/devfreq/.../polling_interval
Date: September 2011
Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
Description:
The /sys/class/devfreq/.../polling_interval shows and sets
the requested polling interval of the corresponding devfreq
object. The values are represented in ms. If the value is
less than 1 jiffy, it is considered to be 0, which means
no polling. This value is meaningless if the governor is
not polling; thus. If the governor is not using
devfreq-provided central polling
(/sys/class/devfreq/.../central_polling is 0), this value
may be useless.

What: /sys/class/devfreq/.../trans_stat
Date: October 2012
Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
Expand All @@ -66,14 +52,6 @@ Description:

echo 0 > /sys/class/devfreq/.../trans_stat

What: /sys/class/devfreq/.../userspace/set_freq
Date: September 2011
Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
Description:
The /sys/class/devfreq/.../userspace/set_freq shows and
sets the requested frequency for the devfreq object if
userspace governor is in effect.

What: /sys/class/devfreq/.../available_frequencies
Date: October 2012
Contact: Nishanth Menon <nm@ti.com>
Expand Down Expand Up @@ -110,6 +88,35 @@ Description:
The max_freq overrides min_freq because max_freq may be
used to throttle devices to avoid overheating.

What: /sys/class/devfreq/.../polling_interval
Date: September 2011
Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
Description:
The /sys/class/devfreq/.../polling_interval shows and sets
the requested polling interval of the corresponding devfreq
object. The values are represented in ms. If the value is
less than 1 jiffy, it is considered to be 0, which means
no polling. This value is meaningless if the governor is
not polling; thus. If the governor is not using
devfreq-provided central polling
(/sys/class/devfreq/.../central_polling is 0), this value
may be useless.

A list of governors that support the node:
- simple_ondmenad
- tegra_actmon

What: /sys/class/devfreq/.../userspace/set_freq
Date: September 2011
Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
Description:
The /sys/class/devfreq/.../userspace/set_freq shows and
sets the requested frequency for the devfreq object if
userspace governor is in effect.

A list of governors that support the node:
- userspace

What: /sys/class/devfreq/.../timer
Date: July 2020
Contact: Chanwoo Choi <cw00.choi@samsung.com>
Expand All @@ -122,3 +129,6 @@ Description:

echo deferrable > /sys/class/devfreq/.../timer
echo delayed > /sys/class/devfreq/.../timer

A list of governors that support the node:
- simple_ondemand
71 changes: 69 additions & 2 deletions Documentation/devicetree/bindings/devfreq/exynos-bus.txt
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,19 @@ Optional properties only for parent bus device:
- exynos,saturation-ratio: the percentage value which is used to calibrate
the performance count against total cycle count.

Optional properties for the interconnect functionality (QoS frequency
constraints):
- #interconnect-cells: should be 0.
- interconnects: as documented in ../interconnect.txt, describes a path at the
higher level interconnects used by this interconnect provider.
If this interconnect provider is directly linked to a top level interconnect
provider the property contains only one phandle. The provider extends
the interconnect graph by linking its node to a node registered by provider
pointed to by first phandle in the 'interconnects' property.

- samsung,data-clock-ratio: ratio of the data throughput in B/s to minimum data
clock frequency in Hz, default value is 8 when this property is missing.

Detailed correlation between sub-blocks and power line according to Exynos SoC:
- In case of Exynos3250, there are two power line as following:
VDD_MIF |--- DMC
Expand Down Expand Up @@ -135,7 +148,7 @@ Detailed correlation between sub-blocks and power line according to Exynos SoC:
|--- PERIC (Fixed clock rate)
|--- FSYS (Fixed clock rate)

Example1:
Example 1:
Show the AXI buses of Exynos3250 SoC. Exynos3250 divides the buses to
power line (regulator). The MIF (Memory Interface) AXI bus is used to
transfer data between DRAM and CPU and uses the VDD_MIF regulator.
Expand Down Expand Up @@ -184,7 +197,7 @@ Example1:
|L5 |200000 |200000 |400000 |300000 | ||1000000 |
----------------------------------------------------------

Example2 :
Example 2:
The bus of DMC (Dynamic Memory Controller) block in exynos3250.dtsi
is listed below:

Expand Down Expand Up @@ -419,3 +432,57 @@ Example2 :
devfreq = <&bus_leftbus>;
status = "okay";
};

Example 3:
An interconnect path "bus_display -- bus_leftbus -- bus_dmc" on
Exynos4412 SoC with video mixer as an interconnect consumer device.

soc {
bus_dmc: bus_dmc {
compatible = "samsung,exynos-bus";
clocks = <&clock CLK_DIV_DMC>;
clock-names = "bus";
operating-points-v2 = <&bus_dmc_opp_table>;
samsung,data-clock-ratio = <4>;
#interconnect-cells = <0>;
};

bus_leftbus: bus_leftbus {
compatible = "samsung,exynos-bus";
clocks = <&clock CLK_DIV_GDL>;
clock-names = "bus";
operating-points-v2 = <&bus_leftbus_opp_table>;
#interconnect-cells = <0>;
interconnects = <&bus_dmc>;
};

bus_display: bus_display {
compatible = "samsung,exynos-bus";
clocks = <&clock CLK_ACLK160>;
clock-names = "bus";
operating-points-v2 = <&bus_display_opp_table>;
#interconnect-cells = <0>;
interconnects = <&bus_leftbus &bus_dmc>;
};

bus_dmc_opp_table: opp_table1 {
compatible = "operating-points-v2";
/* ... */
}

bus_leftbus_opp_table: opp_table3 {
compatible = "operating-points-v2";
/* ... */
};

bus_display_opp_table: opp_table4 {
compatible = "operating-points-v2";
/* .. */
};

&mixer {
compatible = "samsung,exynos4212-mixer";
interconnects = <&bus_display &bus_dmc>;
/* ... */
};
};
54 changes: 53 additions & 1 deletion Documentation/devicetree/bindings/opp/opp.txt
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,9 @@ Required properties:

- OPP nodes: One or more OPP nodes describing voltage-current-frequency
combinations. Their name isn't significant but their phandle can be used to
reference an OPP.
reference an OPP. These are mandatory except for the case where the OPP table
is present only to indicate dependency between devices using the opp-shared
property.

Optional properties:
- opp-shared: Indicates that device nodes using this OPP Table Node's phandle
Expand Down Expand Up @@ -568,3 +570,53 @@ Example 6: opp-microvolt-<name>, opp-microamp-<name>:
};
};
};

Example 7: Single cluster Quad-core ARM cortex A53, OPP points from firmware,
distinct clock controls but two sets of clock/voltage/current lines.

/ {
cpus {
#address-cells = <2>;
#size-cells = <0>;

cpu@0 {
compatible = "arm,cortex-a53";
reg = <0x0 0x100>;
next-level-cache = <&A53_L2>;
clocks = <&dvfs_controller 0>;
operating-points-v2 = <&cpu_opp0_table>;
};
cpu@1 {
compatible = "arm,cortex-a53";
reg = <0x0 0x101>;
next-level-cache = <&A53_L2>;
clocks = <&dvfs_controller 1>;
operating-points-v2 = <&cpu_opp0_table>;
};
cpu@2 {
compatible = "arm,cortex-a53";
reg = <0x0 0x102>;
next-level-cache = <&A53_L2>;
clocks = <&dvfs_controller 2>;
operating-points-v2 = <&cpu_opp1_table>;
};
cpu@3 {
compatible = "arm,cortex-a53";
reg = <0x0 0x103>;
next-level-cache = <&A53_L2>;
clocks = <&dvfs_controller 3>;
operating-points-v2 = <&cpu_opp1_table>;
};

};

cpu_opp0_table: opp0_table {
compatible = "operating-points-v2";
opp-shared;
};

cpu_opp1_table: opp1_table {
compatible = "operating-points-v2";
opp-shared;
};
};
12 changes: 11 additions & 1 deletion Documentation/driver-api/thermal/power_allocator.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,9 @@ to the speed-grade of the silicon. `sustainable_power` is therefore
simply an estimate, and may be tuned to affect the aggressiveness of
the thermal ramp. For reference, the sustainable power of a 4" phone
is typically 2000mW, while on a 10" tablet is around 4500mW (may vary
depending on screen size).
depending on screen size). It is possible to have the power value
expressed in an abstract scale. The sustained power should be aligned
to the scale used by the related cooling devices.

If you are using device tree, do add it as a property of the
thermal-zone. For example::
Expand Down Expand Up @@ -269,3 +271,11 @@ won't be very good. Note that this is not particular to this
governor, step-wise will also misbehave if you call its throttle()
faster than the normal thermal framework tick (due to interrupts for
example) as it will overreact.

Energy Model requirements
=========================

Another important thing is the consistent scale of the power values
provided by the cooling devices. All of the cooling devices in a single
thermal zone should have power values reported either in milli-Watts
or scaled to the same 'abstract scale'.
30 changes: 25 additions & 5 deletions Documentation/power/energy-model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,21 @@ possible source of information on its own, the EM framework intervenes as an
abstraction layer which standardizes the format of power cost tables in the
kernel, hence enabling to avoid redundant work.

The power values might be expressed in milli-Watts or in an 'abstract scale'.
Multiple subsystems might use the EM and it is up to the system integrator to
check that the requirements for the power value scale types are met. An example
can be found in the Energy-Aware Scheduler documentation
Documentation/scheduler/sched-energy.rst. For some subsystems like thermal or
powercap power values expressed in an 'abstract scale' might cause issues.
These subsystems are more interested in estimation of power used in the past,
thus the real milli-Watts might be needed. An example of these requirements can
be found in the Intelligent Power Allocation in
Documentation/driver-api/thermal/power_allocator.rst.
Kernel subsystems might implement automatic detection to check whether EM
registered devices have inconsistent scale (based on EM internal flag).
Important thing to keep in mind is that when the power values are expressed in
an 'abstract scale' deriving real energy in milli-Joules would not be possible.

The figure below depicts an example of drivers (Arm-specific here, but the
approach is applicable to any architecture) providing power costs to the EM
framework, and interested clients reading the data from it::
Expand Down Expand Up @@ -73,14 +88,18 @@ Drivers are expected to register performance domains into the EM framework by
calling the following API::

int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
struct em_data_callback *cb, cpumask_t *cpus);
struct em_data_callback *cb, cpumask_t *cpus, bool milliwatts);

Drivers must provide a callback function returning <frequency, power> tuples
for each performance state. The callback function provided by the driver is free
to fetch data from any relevant location (DT, firmware, ...), and by any mean
deemed necessary. Only for CPU devices, drivers must specify the CPUs of the
performance domains using cpumask. For other devices than CPUs the last
argument must be set to NULL.
The last argument 'milliwatts' is important to set with correct value. Kernel
subsystems which use EM might rely on this flag to check if all EM devices use
the same scale. If there are different scales, these subsystems might decide
to: return warning/error, stop working or panic.
See Section 3. for an example of driver implementing this
callback, and kernel/power/energy_model.c for further documentation on this
API.
Expand Down Expand Up @@ -156,7 +175,8 @@ EM framework::
37 nr_opp = foo_get_nr_opp(policy);
38
39 /* And register the new performance domain */
40 em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb, policy->cpus);
41
42 return 0;
43 }
40 em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb, policy->cpus,
41 true);
42
43 return 0;
44 }
5 changes: 5 additions & 0 deletions Documentation/scheduler/sched-energy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -350,6 +350,11 @@ independent EM framework in Documentation/power/energy-model.rst.
Please also note that the scheduling domains need to be re-built after the
EM has been registered in order to start EAS.

EAS uses the EM to make a forecasting decision on energy usage and thus it is
more focused on the difference when checking possible options for task
placement. For EAS it doesn't matter whether the EM power values are expressed
in milli-Watts or in an 'abstract scale'.


6.3 - Energy Model complexity
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down
1 change: 0 additions & 1 deletion MAINTAINERS
Original file line number Diff line number Diff line change
Expand Up @@ -11438,7 +11438,6 @@ L: linux-pm@vger.kernel.org
L: linux-tegra@vger.kernel.org
T: git git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/linux.git
S: Maintained
F: drivers/devfreq/tegra20-devfreq.c
F: drivers/devfreq/tegra30-devfreq.c

MEMORY MANAGEMENT
Expand Down
3 changes: 2 additions & 1 deletion arch/x86/include/asm/msr-index.h
Original file line number Diff line number Diff line change
Expand Up @@ -327,8 +327,9 @@
#define MSR_PP1_ENERGY_STATUS 0x00000641
#define MSR_PP1_POLICY 0x00000642

#define MSR_AMD_PKG_ENERGY_STATUS 0xc001029b
#define MSR_AMD_RAPL_POWER_UNIT 0xc0010299
#define MSR_AMD_CORE_ENERGY_STATUS 0xc001029a
#define MSR_AMD_PKG_ENERGY_STATUS 0xc001029b

/* Config TDP MSRs */
#define MSR_CONFIG_TDP_NOMINAL 0x00000648
Expand Down
Loading

0 comments on commit b4ec805

Please sign in to comment.