基于 Linux-5.10
EAS在CPU调度领域,在为任务选核是起作用,目的是保证性能的情况下尽可能节省功耗。其基于的能量模型框架(EnergyModel (EM) framework)是一个通用的接口模块,该模块连接了支持不同 perf level 的驱动模块和系统中的其他想要感知能量消耗的模块。其中这里说的EAS,CPU调度器和CPU驱动模块就是一个典型的例子,调度器希望能够感知底层CPU能量的消耗,从而做出更优的选核策略。对于CPU设备,各个Cluster有各自独立的调频机制,Cluster内的CPU统一工作在一个频率下(Qcom做了更改,使每个CPU频点都可以不同)。因此每个Cluster就会形成一W个性能域(performance domain)。调度器通过 EM framework 接口可以获取CPU在各个 performance level 的能量消耗。
1. struct perf_domain
struct perf_domain { struct em_perf_domain *em_pd; struct perf_domain *next; //构成单链表 struct rcu_head rcu; //保护此链表的rcu };
perf_comain 结构表示一个CPU性能域,每一个性能域都是由 perf_domain 抽象。perf_comain 和 cpufreq policy 是一一对应的,对于一个4+3+1结构的平台,因此系统共计3个perf domain,形成链表,链表头在全局 root_domain 中。root_domain 的其它相关成员一并列出,如下:
struct root_domain { ... int overload; //该root domain,即系统是否处于overload状态 int overutilized; //该root domain,即系统是否处于overutilized状态 unsigned long max_cpu_capacity; //系统中算力最大的那个cpu的算力 struct perf_domain __rcu *pd; //perf_domain单链表的表头 }
澄清一下 overload 和 overutilized 这两个概念,在单个cpu overload/overutilized 基础上,又定义了 root domain(即整个系统)的overload和overutilized。
(1) 对于一个 CPU 而言,其处于 overload 状态则说明其 rq 上有大于等于2个任务,或者虽然只有一个任务,但是是 misfit task。
(2) 对于一个 CPU 而言,其处于 overutilized 状态说明该 cpu 的 utility 超过其 capacity(缺省预留20%的算力,另外,这里的 capacity 是用于cfs任务的算力)。
(3) 对于 root domain,overload 表示至少有一个 cpu 处于 overload 状态。overutilized 表示至少有一个 cpu 处于 overutilized 状态。
overutilized 状态非常重要,它决定了调度器是否启用EAS,只有在系统没有 overutilized 的情况下EAS才会生效。overload和newidle balance的频次控制相关,当系统在overload的情况下,newidle balance才会启动进行均衡。
2. struct em_perf_domain
struct em_perf_domain { struct em_perf_state *table; //performance states, 里面的频点必须是升序排列,em_cpu_energy()依赖于此 int nr_perf_states; //table中元素的个数 int milliwatts; //指示功率值的标志,以毫瓦或其他一些标度为单位。 unsigned long cpus[]; //此性能域包含哪些cpu };
此结构存放在cpu对应的 struct device 结构中,那就是per-cpu的结构了!在 EM framework 中,使用 em_perf_domain 来抽象一个 performance domain。
3. struct em_perf_state
struct em_perf_state { unsigned long frequency; //单位KHz,与 CPUFreq 保持一致 unsigned long power; //单位毫瓦,此频点下的功率,它可以是总的功耗:静态+动态 unsigned long cost; //与频点下的成本系数,功率计算过程中使用,等于:power * max_frequency / frequency };
每个性能域都有若干个 perf level,每一个 perf level 对应能耗是不同的,使用用 struct em_perf_state 来表示一个 perf level 下的能耗信息。
1. 能量计算方法概述
基本计算公式:能量 = 功率 X 时间
对于CPU而言,要计算其能量需要进一步细化公式(省略了CPU处于idle状态的能耗):CPU在此频点的能量消耗 = CPU在此频点的功率 X CPU在此频点的运行时间。
EM中记录了CPU各个频点的功率,使用 em_perf_state::power 来表示,这是事先Soc供应商计算好的。而运行时间是通过 cpu utility 来表示的。有一个不太方便的地方就是CPU utility 是归一化到 1024 的一个值,失去了在某个频点的运行时间长度的信息,不过可以转换:CPU在此频点运行时间 = cpu_util / cpu_current_capacity。注意计算energy只是为了比较大小,就省略了周期。
CPU在某个 perf-state(即某个频点)下的算力:ps->cap = scale_cpu * (ps->freq / cpu_max_freq) ----(1)。scale_cpu是cpu在最大频点时scale到1024后的算力。
不考虑 idle state 的功耗,CPU在某个 perf-state 的能量估计:cpu_nrg = ps->power * (cpu_util / ps->cap) ------(2)。
把(1)带入(2)得到:cpu_nrg = (ps->power * cpu_max_freq /ps->freq) * (cpu_util / scale_cpu) ----(3)。
上面公式的第一项是一个常量,保存在 em_perf_state 的 cost 成员中,即CPU在某个 perf-state 的能量估计:cpu_nrg = ps->cost * cpu_util / scale_cpu ----(4)。
由于每个 perf domain 中的微架构都是一样的,因此 scale_cpu 的值是一样的,那么 cost 也是一样的,通过提取公因式可以得到整个 perf domain(cpu cluster)的能耗公式:pd_nrg = ps->cost * \Sum cpu_util / scale_cpu ----(5)。
1. perf domain 的构建
在CPU拓扑初始化的时候,通过 build_perf_domains() 创建各个perf domain,并作为 root domain 的 perf domain 链表。
(1) 函数实现
//kernel/sched/topology.c static bool build_perf_domains(const struct cpumask *cpu_map) { int i, nr_pd = 0, nr_ps = 0, nr_cpus = cpumask_weight(cpu_map); struct perf_domain *pd = NULL, *tmp; int cpu = cpumask_first(cpu_map); struct root_domain *rd = cpu_rq(cpu)->rd; bool eas_check = false; if (!sysctl_sched_energy_aware) //若没有使能EAS,直接不build,退出 goto free; ... for_each_cpu(i, cpu_map) { /* Skip already covered CPUs. */ if (find_pd(pd, i)) //跳过已经被包含在某个pd中的cpu,这样的话各个cluster的首个cpu才能继续往下执行 continue; /* Create the new pd and add it to the local list. */ tmp = pd_init(i); tmp->next = pd; //单链表上元素个数为cpu的个数 pd = tmp; //等同于头插法,后probe的cluster插入在链表头,pd指向链表头 /* Count performance domains and performance states for the complexity check. */ nr_pd++; //所有pd的ps的数量之和 nr_ps += em_pd_nr_perf_states(pd->em_pd); //return pd->nr_perf_states; } /* Bail out if the Energy Model complexity is too high. */ if (nr_pd * (nr_ps + nr_cpus) > EM_MAX_COMPLEXITY) { //2048 能量模型的复杂度不能太高 WARN(1, "rd %*pbl: Failed to start EAS, EM complexity is too high\n", cpumask_pr_args(cpu_map)); goto free; } //打印整个pd的信息,debug才会打印 perf_domain_debug(cpu_map, pd); /* Attach the new list of performance domains to the root domain. */ tmp = rd->pd; rcu_assign_pointer(rd->pd, pd); //全局变量 root_domain::pd 指向perf_domain链表头 if (tmp) call_rcu(&tmp->rcu, destroy_perf_domain_rcu); //rcu更新,root_domain::pd指向新的,删除旧的 pr_info("nr_pd = %d\n", nr_pd); //cpu7没有isolate就是3,否则就是2 return !!pd; free: free_pd(pd); tmp = rd->pd; rcu_assign_pointer(rd->pd, NULL); if (tmp) call_rcu(&tmp->rcu, destroy_perf_domain_rcu); return false; } static struct perf_domain *pd_init(int cpu) { struct em_perf_domain *obj = em_cpu_get(cpu); struct perf_domain *pd = kzalloc(sizeof(*pd), GFP_KERNEL); pd->em_pd = obj; //指针指向 return pd; } //kernel/power/energy_model.c struct em_perf_domain *em_cpu_get(int cpu) { struct device *cpu_dev = get_cpu_device(cpu); //return per_cpu(cpu_sys_devices, cpu); return em_pd_get(cpu_dev); //return dev->em_pd, 直接存放在cpu对应的device结构中的 }
root_domain的pd成员指向的perf-domain链表是头插法,各个pd在链表上的顺序是 root_domain->pd --> cluster2->pd --> --> cluster1->pd --> cluster0->pd。pd是per-cluster的,不是per-cpu的。只有一个cluster 的全部cpu都被isolate了,其pd才会从链表上删除。
(2)调用路径:
init_cpu_capacity_callback //arch_topology.c 初始化时cpu算力更新时执行 schedule_work(&update_topology_flags_work); init_cpu_capacity_callback //arch_topology.c update_topology_flags_workfn //arch_topology.c cpuset_hotplug_workfn //cpuset.c 下面有调用路径 ///proc/sys/kernel/sched_energy_aware 的响应函数 sched_energy_aware_handler //topology.c rebuild_sched_domains //cpuset.c pause_cpus //cpu.c 执行出错的时候调用 //cpu.c cpuhp_hp_states[]的"sched:active"的.startup.single回调 resume_cpus //cpu.c sched_cpus_activate //core.c pause_cpus //cpu.c sched_cpus_deactivate_nosync //core.c sched_cpu_activate //core.c cpuset_cpu_active //core.c //cpu.c cpuhp_hp_states[]的"sched:active"的.teardown.single回调 resume_cpus //cpu.c sched_cpus_activate //core.c sched_cpu_deactivate //core.c pause_cpus //cpu.c sched_cpus_deactivate_nosync //core.c _sched_cpu_deactivate cpuset_cpu_inactive cpuset_update_active_cpus //cpuset.c cpuset_track_online_nodes_nb.notifier_call //cpu.c cpuset_track_online_nodes //cpuset.c schedule_work(&cpuset_hotplug_work); resume_cpus //cpu.c cpuset_update_active_cpus_affine //cpuset.c schedule_work_on(cpu, &cpuset_hotplug_work); //调用指定cpu上的 //文件/dev/cpuset/[<group>/]cpus、mems的写回调函数 cpuset_write_resmask //cpuset.c flush_work(&cpuset_hotplug_work); //执行完工作队列上的任务,只是flush cpuset_hotplug_workfn //工作队列处理函数 rebuild_sched_domains_locked partition_sched_domains_locked build_perf_domains //每次传参都是cpu0-6 或 cpu0-7,不是一个cluster一个cluster传参的
cpuset_hotplug_workfn 下面的部分是加 dump_stack() 打印出来的内容,内核启动、online/offline、isolate/unisolate 调用的路径相同。上面部分是按代码实现上找出的调用路径。
在每个CPU online/offline、isolate/unisolate 都会触发domain的rebuild流程。
cgroup分组中指定对cpus文件的改动不会触发rebuild流程。
em_pd->cpus仅仅表示一个cluster包含哪些cpu,isolate/unisolate和online/offline cpu对其值没影响。
sysctl全局控制变量为 sysctl_sched_energy_aware,对应的控制文件为 /proc/sys/kernel/sched_energy_aware。
//kernel/sched/topology.c int sched_energy_aware_handler(struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { int ret, state; if (write && !capable(CAP_SYS_ADMIN)) return -EPERM; ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); if (!ret && write) { state = static_branch_unlikely(&sched_energy_present); if (state != sysctl_sched_energy_aware) { mutex_lock(&sched_energy_mutex); sched_energy_update = 1; //partition_sched_domains_locked中唯一使用 rebuild_sched_domains(); sched_energy_update = 0; mutex_unlock(&sched_energy_mutex); } } return ret; }
sched_energy_present 的更新:
/* * kernel/sched/topology.c * partition_sched_domains_locked --> sched_energy_set */ static void sched_energy_set(bool has_eas) { if (!has_eas && static_branch_unlikely(&sched_energy_present)) { static_branch_disable_cpuslocked(&sched_energy_present); } else if (has_eas && !static_branch_unlikely(&sched_energy_present)) { static_branch_enable_cpuslocked(&sched_energy_present); } }
sched_energy_enabled() 中判断这个static key值。使用位置有2:
(1) 负载均衡路径中,find_busiest_group() 中判断使能了EAS并且系统没有overutilized,就终止此次balance。
(2) 任务选核路径中,select_task_rq_fair() 中判断使能了EAS才会调用find_energy_efficient_cpu()进行EAS路径选核。
1. 唤醒场景的进入EAS选核的条件
对于阻塞状态的任务,异步事件或者其他线程调用 try_to_wake_up() 会唤醒该线程,唤醒后会进行task placement,也即为唤醒任务选核。如果使能了EAS,那么优先采用EAS选核。当然,只有在轻载(系统没有overutilized)才会启用EAS,重载下(只要有一个cpu处于over utilized状态)还是使用传统内核算法选核。
static int select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags) { int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING); ... trace_android_rvh_select_task_rq_fair(p, prev_cpu, sd_flag, wake_flags, &target_cpu); if (target_cpu >= 0) return target_cpu; ... //只有唤醒场景才有可能走EAS选核路径 if (sd_flag & SD_BALANCE_WAKE) { //sysctl全局使能控制是否开启EAS if (sched_energy_enabled()) { new_cpu = find_energy_efficient_cpu(p, prev_cpu, sync); if (new_cpu >= 0) return new_cpu; //只要EAS选到核,就使用EAS的选核结果 } } ... }
由上面代码可见,EAS只用于wakeup,fork和exec均衡都不走EAS选核算法。find_energy_efficient_cpu()是EAS的主选核路径,使用EAS选核需要满足两个条件:是唤醒路径且使能了EAS特性。EAS选中了适合的CPU就直接返回。如果EAS选核不成功,那么恢复缺省cpu为prev cpu,走传统选核路径重新选核。
2. EAS选核细节
EAS选核细节在 find_energy_efficient_cpu() 中体现,如下:
static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu, int sync) { unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX; struct root_domain *rd = cpu_rq(smp_processor_id())->rd; int max_spare_cap_cpu_ls = prev_cpu, best_idle_cpu = -1; unsigned long max_spare_cap_ls = 0, target_cap; unsigned long cpu_cap, util, base_energy = 0; bool boosted, latency_sensitive = false; unsigned int min_exit_lat = UINT_MAX; int cpu, best_energy_cpu = prev_cpu; struct cpuidle_state *idle; struct sched_domain *sd; struct perf_domain *pd; int new_cpu = INT_MAX; //更新任务负载 sync_entity_load_avg(&p->se); //Vendor厂商或ODM厂商可能会注册hook从而不使用这个函数 trace_android_rvh_find_energy_efficient_cpu(p, prev_cpu, sync, &new_cpu); //hook if (new_cpu != INT_MAX) return new_cpu; rcu_read_lock(); //从rd中获取 perf_domain 链表 pd = rcu_dereference(rd->pd); if (!pd || READ_ONCE(rd->overutilized)) //若系统中只要有一个cpu是overutilized,就退出EAS选核 goto fail; cpu = smp_processor_id(); //当前正在运行的cpu //若是同步唤醒,且当前cpu只有本任务在运行,且唤醒的任务运行运行在此cpu上,且此cpu的算力能满足被唤醒任务的需求,那就直接选择当前cpu if (sync && cpu_rq(cpu)->nr_running == 1 && cpumask_test_cpu(cpu, p->cpus_ptr) && task_fits_capacity(p, capacity_of(cpu))) { //uclamp util小于80%的CPU算力 rcu_read_unlock(); return cpu; } /* Energy-aware wake-up happens on the lowest sched_domain starting from sd_asym_cpucapacity spanning over this_cpu and prev_cpu. */ sd = rcu_dereference(*this_cpu_ptr(&sd_asym_cpucapacity)); //返回此cpu对应的DIE层级的sd //若prev_cpu是个有效的cpuid,在手机上,这个判断完全是多于的 while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd))) sd = sd->parent; if (!sd) goto fail; //max(util, util_est) 任务p的util为0,goto unlock是直接返回prev_cpu if (!task_util_est(p)) goto unlock; //待唤醒任务p所在cgroup的是否设置了cpu.uclamp.latency_sensitive 标志 latency_sensitive = uclamp_latency_sensitive(p); //受全局和cgroup限制后的任务p的uclamp min是否还大于0 boosted = uclamp_boosted(p); target_cap = boosted ? 0 : ULONG_MAX; //这个值是根据下面的使用逻辑赋的 //从大核开始遍历,次序为:大核-->中核-->小核。这个遍历次序无关紧要,疑问是所有都遍历完才做的决策 for (; pd; pd = pd->next) { //循环体中的变量,遍历每个pd时都是新的 unsigned long cur_delta, spare_cap, max_spare_cap = 0; unsigned long base_energy_pd; int max_spare_cap_cpu = -1; /* Compute the 'base' energy of the pd, without @p */ //计算不包括p的情况下此pd的energy,作为基准energy。注意dst_cpu传-1,p的util也会从其之前运行的cpu上被减去 base_energy_pd = compute_energy(p, -1, pd); //不包括p的情况下系统的总energy base_energy += base_energy_pd; /* * 这里竟然没有判断是否为active的cpu! pd->em_pd->cpus仅表示一个cluster包含哪些cpu, offline的cpu会从sd->span * 中清除掉,但是isolated的不会。上面'base' energy的计算也可能有问题。 */ for_each_cpu_and(cpu, perf_domain_span(pd), sched_domain_span(sd)) { if (!cpumask_test_cpu(cpu, p->cpus_ptr)) //过滤掉p不允许运行的cpu核 continue; util = cpu_util_next(cpu, p, cpu); //计算p放到此cpu上后此cpu上的util cpu_cap = capacity_of(cpu); spare_cap = cpu_cap; lsub_positive(&spare_cap, util); //计算p放到此cpu上后此cpu还剩余的算力 /* * Skip CPUs that cannot satisfy the capacity request. IOW, placing the task there would make the CPU * overutilized. Take uclamp into account to see how much capacity we can get out of the CPU; this is * aligned with schedutil_cpu_util(). */ //对util进行一下uclamp,若clmap后cpu算力不满足需求了,就放弃此cpu的继续探测 util = uclamp_rq_util_with(cpu_rq(cpu), util, p); if (!fits_capacity(util, cpu_cap)) continue; /* Always use prev_cpu as a candidate. */ if (!latency_sensitive && cpu == prev_cpu) { //若对延迟不敏感,且对比的这个cpu就是任务之前运行的cpu prev_delta = compute_energy(p, prev_cpu, pd); //计算p放在prev_cpu后整个pd的energy prev_delta -= base_energy_pd; //计算p放在prev_cpu后整个pd的增加的energy best_delta = min(best_delta, prev_delta); //这里又取最小值 } /* * Find the CPU with the maximum spare capacity in the performance domain */ //记录p放上去后剩余算力最大的cpu和其剩余算力 if (spare_cap > max_spare_cap) { max_spare_cap = spare_cap; max_spare_cap_cpu = cpu; } if (!latency_sensitive) //若对延迟不敏感,取消对此cpu的继续探测 continue; /*--- 下面就是延迟敏感情况下的才需要执行的 ---*/ if (idle_cpu(cpu)) { cpu_cap = capacity_orig_of(cpu); //若是boosted,target_cap 初始化为0。若是boost,尽量往算力大的CPU上选 if (boosted && cpu_cap < target_cap) continue; //若是非boosted,target_cap 初始化为ULONG_MAX。若是非boost,尽量往算力小的CPU上选 if (!boosted && cpu_cap > target_cap) continue; idle = idle_get_state(cpu_rq(cpu)); //return rq->idle_state; //CPU算力相等的情况下,选idle退出延迟最小的。若exit_latency上变为">=",有利于从cluster的首个CPU开始选 if (idle && idle->exit_latency > min_exit_lat && cpu_cap == target_cap) continue; if (idle) //对idle的判断只是避免程序崩溃而已,记录合适idle cpu的退出延迟,这里不是最小退出延迟的意思。 min_exit_lat = idle->exit_latency; target_cap = cpu_cap; //保存idle cpu的算力 best_idle_cpu = cpu; //记录认为是最好的idle cpu: } else if (spare_cap > max_spare_cap_ls) { //延迟敏感,又非idle cpu max_spare_cap_ls = spare_cap; //记录最大空余算力 max_spare_cap_cpu_ls = cpu; //记录最大空余算力的cpu } } /*---下面就是一个cluster的cpu遍历完后的处理---*/ /* Evaluate the energy impact of using this CPU.*/ if (!latency_sensitive && max_spare_cap_cpu >= 0 && max_spare_cap_cpu != prev_cpu) { //计算p放在当前cluster的最大空余算力的cpu上后其pd的energy增量,和其它所有cpu对比这个增量,取较小的 cur_delta = compute_energy(p, max_spare_cap_cpu, pd); cur_delta -= base_energy_pd; if (cur_delta < best_delta) { best_delta = cur_delta; best_energy_cpu = max_spare_cap_cpu; } } } //下面就是遍历完了: unlock: rcu_read_unlock(); if (latency_sensitive) return best_idle_cpu >= 0 ? best_idle_cpu : max_spare_cap_cpu_ls; /* * Pick the best CPU if prev_cpu cannot be used, or if it saves at least 6% of the energy used by prev_cpu. */ if (prev_delta == ULONG_MAX) return best_energy_cpu; //放在prev_cpu上的energy增量与放在每个cluster空余算力最大的cpu上energy增量的差值,大于把任务p放在prev_cpu上energy消耗的6.25% if ((prev_delta - best_delta) > ((prev_delta + base_energy) >> 4)) return best_energy_cpu; return prev_cpu; fail: rcu_read_unlock(); return -1; }
原生逻辑,所有遍历下,只计算了nr_cluster+2次energy,分别是任务p不放在任何cpu上的基准energy、计算放在prev_cpu上的energy、计算放在每个cluster的最大空余算力的cpu时cluster的energy。主要比较的是在非latency_sensitive的情况下,将任务p放置在各个cluster的剩余算力最大的cpu上,然后对比,选一个能量增量最小的具有最大空余算力的cpu作为备选cpu。
总结这个函数选核逻辑如下:
(1) 若任务p是latency_sensitive的,若best_idle_cpu存在就返回best_idle_cpu,若best_idle_cpu不存在就返回空余算力最大的cpu。best_idle_cpu筛选的条件为:
a. 首先需要是idle的cpu.
b. 若任务p被uclamp min值了,就认为是boost的,那么就尽量往算力大的CPU上选,否则尽量往算力小的CPU上选。
c. 若相同算力的CPU选退出延迟短的,也就是休眠深度浅的CPU.
空余算力最大的cpu的筛选条件为:
a. 首先需要是非idle的cpu.
b. 其次需要是任务p放到此cpu上后,空余算力最大的cpu.
(2) 若任务p是非latency_sensitive的,若prev_cpu不可用(任务p的亲和性不允许运行在prev_cpu或prev_cpu剩余的算力容纳不下任务p了),那么直接返回best_energy_cpu,
best_energy_cpu的筛选条件:
a. 默认是取prev_cpu的
b. 每个cluster的最大空余算力的那个CPU之间进行PK,放置上任务p到其上后,哪个CPU的energy增量小,选哪一个CPU。
(3) 若任务p是非latency_sensitive的,且prev_cpu可用,且放在prev_cpu上的energy增量与放在best_energy_cpu的差值,小于等于把任务p放在prev_cpu上energy消耗的6.25%,那么选prev_cpu. 因为能量节省有限,选prev_cpu可以减少cache miss. 可以增加一个优化,并且在cache_hot()的情况下才选prev_cpu。
此函数用到的 compute_energy 函数:
/* * 作用:计算任务p迁移到dst_cpu上后,整个pd,也就是此cluster的energy。若dst_cpu传-1,就表示 * 任务p不运行在pd内的任何一个cpu上时,此pd的energy,也即是base energy。 */ static long compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd) { struct cpumask *pd_mask = perf_domain_span(pd); unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask)); //return per_cpu(cpu_scale, cpu); 此cpu的算力 unsigned long max_util = 0, sum_util = 0; unsigned long energy = 0; int cpu; //对此pd中的每一个online cpu都执行 for_each_cpu_and(cpu, pd_mask, cpu_online_mask) { //计算若p运行在dst_cpu上,此pd下各个cpu变化后的util值 unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu); struct task_struct *tsk = cpu == dst_cpu ? p : NULL; //注意传参,这里可能恒为NULL //返回cfs+irq+rt+dl使用掉的cpu算力之和.注意这里传的是ENERGY_UTIL sum_util += schedutil_cpu_util(cpu, util_cfs, cpu_cap, ENERGY_UTIL, NULL); //这次计算util考虑了uclamp,util大概率是往高处clamp的,dl的util计算方式也不同,这里使用的是带宽 cpu_util = schedutil_cpu_util(cpu, util_cfs, cpu_cap, FREQUENCY_UTIL, tsk); //tsk是否为NULL只对clamp区间值有影响 //取此pd中所有cpu的cpu_util的最大值 max_util = max(max_util, cpu_util); } energy = em_cpu_energy(pd->em_pd, max_util, sum_util); //返回的是整个pd的energy return energy; }
em_cpu_energy 来根据此cluster上所有cpu的util之和计算energy和通过util最大的那个cpu的util 去调频。
/* * em_cpu_energy() - Estimates the energy consumed by the CPUs of a performance domain * @pd : performance domain for which energy has to be estimated * @max_util : highest utilization among CPUs of the domain * @sum_util : sum of the utilization of all CPUs in the domain */ /* * 作用:计算pd的energy,参数max_util用来为此cluster调频,sum_util用来计算此cluster即pd的energy */ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, unsigned long max_util, unsigned long sum_util) { unsigned long freq, scale_cpu; struct em_perf_state *ps; int i, cpu; if (!sum_util) return 0; cpu = cpumask_first(to_cpumask(pd->cpus)); scale_cpu = arch_scale_cpu_capacity(cpu); //此pd下cpu的算力 ps = &pd->table[pd->nr_perf_states - 1]; //由于是升序排列,这是最大的perf-state freq = map_util_freq(max_util, ps->frequency, scale_cpu); //return (freq + (freq >> 2)) * util / cap = 1.25 * (util / cap) * max_freq ; /* * Find the lowest performance state of the Energy Model above the requested frequency. */ //找一个频点刚好大于等于计算出来的freq的em_perf_state for (i = 0; i < pd->nr_perf_states; i++) { ps = &pd->table[i]; if (ps->frequency >= freq) break; } /* * The capacity of a CPU in the domain at the performance state (ps) * can be computed as: * * ps->freq * scale_cpu * ps->cap = -------------------- (1) * cpu_max_freq * * So, ignoring the costs of idle states (which are not available in * the EM), the energy consumed by this CPU at that performance state * is estimated as: * * ps->power * cpu_util * cpu_nrg = -------------------- (2) * ps->cap * * since 'cpu_util / ps->cap' represents its percentage of busy time. * * NOTE: Although the result of this computation actually is in * units of power, it can be manipulated as an energy value * over a scheduling period, since it is assumed to be * constant during that interval. * * By injecting (1) in (2), 'cpu_nrg' can be re-expressed as a product * of two terms: * * ps->power * cpu_max_freq cpu_util * cpu_nrg = ------------------------ * --------- (3) * ps->freq scale_cpu * * The first term is static, and is stored in the em_perf_state struct * as 'ps->cost'. * * Since all CPUs of the domain have the same micro-architecture, they * share the same 'ps->cost', and the same CPU capacity. Hence, the * total energy of the domain (which is the simple sum of the energy of * all of its CPUs) can be factorized as: * * ps->cost * \Sum cpu_util * pd_nrg = ------------------------ (4) * scale_cpu */ return ps->cost * sum_util / scale_cpu; //就是之前计算的,整个pd的功耗 }
cpu_util_next 用来计算若将任务p放置在dst_cpu上后,此pd各个cpu的util。遍历此pd内的每个cpu就可以可以得到此pd内的sum_util,从而计算pd的energy。
/* * Predicts what cpu_util(@cpu) would return if @p was migrated (and enqueued) to @dst_cpu. * 作用:预测若任务p迁移到参数dst_cpu上后,参数cpu上的util值 */ //compute_energy 传参:(cpu, p, -1) cpu为此pd中的某个cpu,注意dst_cpu传的是-1。-1就只可能减,不可能加了 static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu) { struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs; unsigned long util_est, util = READ_ONCE(cfs_rq->avg.util_avg); //cfs_rq的util值,不会写回。 /* * If @p migrates from @cpu to another, remove its contribution. Or, * if @p migrates from another CPU to @cpu, add its contribution. In * the other cases, @cpu is not impacted by the migration, so the * util_avg should already be correct. */ //若此cpu是任务p之前运行的cpu,但是不是p将要运行的cpu if (task_cpu(p) == cpu && dst_cpu != cpu) sub_positive(&util, task_util(p)); //从cfs_rq的util中减去p的util //若此cpu不是任务p之前运行的cpu,但是是p将要运行的cpu else if (task_cpu(p) != cpu && dst_cpu == cpu) util += task_util(p); //cfs_rq的util加上p的util if (sched_feat(UTIL_EST)) { util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued); /* * During wake-up, the task isn't enqueued yet and doesn't * appear in the cfs_rq->avg.util_est.enqueued of any rq, * so just add it (if needed) to "simulate" what will be * cpu_util() after the task has been enqueued. */ //若cpu就是任务p要运行的 cpu if (dst_cpu == cpu) util_est += _task_util_est(p); util = max(util, util_est); } //返回判断后cfs_rq的util return min(util, capacity_orig_of(cpu)); } /* * 对于非dst cpu: compute_energy:传参(cpu, util_cfs, cpu_cap, ENERGY_UTIL, NULL) cpu为此pd下的某个cpu, * util_cfs是这个cpu对应的任务p迁移到dst_cpu后的util, cpu_cap是此pd下单个cpu的算力 * * 对于dst cpu: compute_energy:传参(cpu, util_cfs, cpu_cap, FREQUENCY_UTIL, tsk) * * 为了减少篇幅,下面两个函数都删除了大量注释 */ /* * This function computes an effective utilization for the given CPU, to be * used for frequency selection given the linear relation: f = u * f_max. */ //作用:计算cpu上的有效util unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs, unsigned long max, enum schedutil_type type, struct task_struct *p) { unsigned long dl_util, util, irq; struct rq *rq = cpu_rq(cpu); if (!uclamp_is_used() && type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) { return max; } irq = cpu_util_irq(rq); if (unlikely(irq >= max)) return max; util = util_cfs + cpu_util_rt(rq); //return rq->avg_rt.util_avg if (type == FREQUENCY_UTIL) //FREQUENCY_UTIL 才会考虑uclamp,EAS的计算不考虑 util = uclamp_rq_util_with(rq, util, p); dl_util = cpu_util_dl(rq); //return rq->avg_dl.util_avg if (util + dl_util >= max) //CFS+RT+DL 已经超过cpu的算力了 return max; /* * OTOH, for energy computation we need the estimated running time, so * include util_dl and ignore dl_bw. */ if (type == ENERGY_UTIL) util += dl_util; util = scale_irq_capacity(util, irq, max); util += irq; //util = util * (1 - irq/max) + irq if (type == FREQUENCY_UTIL) util += cpu_bw_dl(rq); return min(max, util); //返回cfs+irq+rt+dl后的cpu的算力 }
注意代码中的HOOK,厂商可能会修改导致不执行原生EAS选核逻辑。
1. DEBUG perf_domain 链表的程序
/* 放到 kernel/sched 下面 */ #define pr_fmt(fmt) "perf_domain_debug: " fmt #include <linux/fs.h> #include <linux/sched.h> #include <linux/proc_fs.h> #include <linux/seq_file.h> #include <linux/string.h> #include <linux/printk.h> #include <asm/topology.h> #include <linux/cpumask.h> #include <linux/sched/topology.h> #include "sched.h" struct perf_domain_debug_t { int cmd; }; static struct perf_domain_debug_t pdd; static void perf_domain_debug(struct seq_file *m, struct perf_domain *pd) { int i; struct em_perf_domain *em_pd = pd->em_pd; seq_printf(m, "em_pd->nr_perf_states=%d, em_pd->milliwatts=%d, em_pd->cpus==%*pbl \n", em_pd->nr_perf_states, em_pd->milliwatts, cpumask_pr_args(to_cpumask(em_pd->cpus))); for (i = 0; i < em_pd->nr_perf_states; i++) { seq_printf(m, "[%d]: frequency=%lu, power=%lu, cost=%ld\n", i, em_pd->table[i].frequency, em_pd->table[i].power, em_pd->table[i].cost); } seq_printf(m, "-------------------------------------------------------------------\n"); } static int perf_domain_debug_show(struct seq_file *m, void *v) { struct root_domain *rd = cpu_rq(0)->rd; struct perf_domain *pd = rd->pd; while (pd) { perf_domain_debug(m, pd); pd = pd->next; } return 0; } static int perf_domain_debug_open(struct inode *inode, struct file *file) { return single_open(file, perf_domain_debug_show, NULL); } static ssize_t perf_domain_debug_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { int ret, cmd_value; char buffer[32] = {0}; if (count >= sizeof(buffer)) { count = sizeof(buffer) - 1; } if (copy_from_user(buffer, buf, count)) { pr_info("copy_from_user failed\n"); return -EFAULT; } ret = sscanf(buffer, "%d", &cmd_value); if(ret <= 0){ pr_info("sscanf dec failed\n"); return -EINVAL; } pr_info("cmd_value=%d\n", cmd_value); pdd.cmd = cmd_value; return count; } //Linux5.10 change file_operations to proc_ops static const struct proc_ops perf_domain_debug_fops = { .proc_open = perf_domain_debug_open, .proc_read = seq_read, .proc_write = perf_domain_debug_write, .proc_lseek = seq_lseek, .proc_release = single_release, }; static int __init perf_domain_debug_init(void) { proc_create("perf_domain_debug", S_IRUGO | S_IWUGO, NULL, &perf_domain_debug_fops); pr_info("domain_topo_debug probed\n"); return 0; } fs_initcall(perf_domain_debug_init);View Code
2. 测试结果
# cat /proc/perf_domain_debug em_pd->nr_perf_states=28, em_pd->milliwatts=1, em_pd->cpus==7 [0]: frequency=1300000, power=308, cost=722615 [1]: frequency=1400000, power=353, cost=769035 [2]: frequency=1500000, power=393, cost=799100 [3]: frequency=1600000, power=444, cost=846375 [4]: frequency=1700000, power=490, cost=879117 [5]: frequency=1800000, power=538, cost=911611 [6]: frequency=1900000, power=588, cost=943894 [7]: frequency=2000000, power=651, cost=992775 [8]: frequency=2050000, power=691, cost=1028073 [9]: frequency=2100000, power=732, cost=1063142 [10]: frequency=2150000, power=785, cost=1113604 [11]: frequency=2200000, power=830, cost=1150681 [12]: frequency=2250000, power=876, cost=1187466 [13]: frequency=2300000, power=922, cost=1222652 [14]: frequency=2350000, power=971, cost=1260234 [15]: frequency=2400000, power=1020, cost=1296250 [16]: frequency=2450000, power=1088, cost=1354448 [17]: frequency=2500000, power=1144, cost=1395680 [18]: frequency=2550000, power=1198, cost=1432901 [19]: frequency=2600000, power=1239, cost=1453442 [20]: frequency=2650000, power=1299, cost=1495075 [21]: frequency=2700000, power=1340, cost=1513703 [22]: frequency=2750000, power=1403, cost=1556054 [23]: frequency=2800000, power=1448, cost=1577285 [24]: frequency=2850000, power=1511, cost=1617035 [25]: frequency=2900000, power=1559, cost=1639637 [26]: frequency=3000000, power=1674, cost=1701900 [27]: frequency=3050000, power=1746, cost=1746000 ------------------------------------------------------------------- em_pd->nr_perf_states=32, em_pd->milliwatts=1, em_pd->cpus==4-6 [0]: frequency=200000, power=21, cost=299250 [1]: frequency=300000, power=31, cost=294500 [2]: frequency=400000, power=41, cost=292125 [3]: frequency=500000, power=55, cost=313500 [4]: frequency=600000, power=70, cost=332500 [5]: frequency=700000, power=87, cost=354214 [6]: frequency=800000, power=104, cost=370500 [7]: frequency=900000, power=125, cost=395833 [8]: frequency=1000000, power=145, cost=413250 [9]: frequency=1100000, power=169, cost=437863 [10]: frequency=1200000, power=192, cost=456000 [11]: frequency=1300000, power=215, cost=471346 [12]: frequency=1400000, power=245, cost=498750 [13]: frequency=1500000, power=272, cost=516800 [14]: frequency=1600000, power=300, cost=534375 [15]: frequency=1700000, power=335, cost=561617 [16]: frequency=1800000, power=379, cost=600083 [17]: frequency=1900000, power=420, cost=630000 [18]: frequency=2000000, power=470, cost=669750 [19]: frequency=2050000, power=496, cost=689560 [20]: frequency=2100000, power=523, cost=709785 [21]: frequency=2150000, power=543, cost=719790 [22]: frequency=2200000, power=572, cost=741000 [23]: frequency=2250000, power=602, cost=762533 [24]: frequency=2300000, power=623, cost=771978 [25]: frequency=2350000, power=645, cost=782234 [26]: frequency=2400000, power=666, cost=790875 [27]: frequency=2450000, power=690, cost=802653 [28]: frequency=2550000, power=736, cost=822588 [29]: frequency=2650000, power=783, cost=842094 [30]: frequency=2750000, power=832, cost=862254 [31]: frequency=2850000, power=880, cost=880000 ------------------------------------------------------------------- em_pd->nr_perf_states=30, em_pd->milliwatts=1, em_pd->cpus==0-3 [0]: frequency=200000, power=14, cost=126000 [1]: frequency=250000, power=19, cost=136800 [2]: frequency=300000, power=23, cost=138000 [3]: frequency=350000, power=28, cost=144000 [4]: frequency=400000, power=32, cost=144000 [5]: frequency=450000, power=37, cost=148000 [6]: frequency=500000, power=43, cost=154800 [7]: frequency=550000, power=47, cost=153818 [8]: frequency=600000, power=53, cost=159000 [9]: frequency=650000, power=59, cost=163384 [10]: frequency=700000, power=63, cost=162000 [11]: frequency=750000, power=70, cost=168000 [12]: frequency=800000, power=76, cost=171000 [13]: frequency=850000, power=81, cost=171529 [14]: frequency=900000, power=87, cost=174000 [15]: frequency=950000, power=94, cost=178105 [16]: frequency=1000000, power=99, cost=178200 [17]: frequency=1050000, power=108, cost=185142 [18]: frequency=1100000, power=115, cost=188181 [19]: frequency=1150000, power=125, cost=195652 [20]: frequency=1200000, power=132, cost=198000 [21]: frequency=1250000, power=140, cost=201600 [22]: frequency=1300000, power=150, cost=207692 [23]: frequency=1350000, power=158, cost=210666 [24]: frequency=1400000, power=166, cost=213428 [25]: frequency=1450000, power=177, cost=219724 [26]: frequency=1500000, power=185, cost=222000 [27]: frequency=1600000, power=205, cost=230625 [28]: frequency=1700000, power=222, cost=235058 [29]: frequency=1800000, power=243, cost=243000 -------------------------------------------------------------------View Code
cpu7被isolate的话cluster3就不会有了,pd单链表次序:cluster3 --> cluster2 --> cluster1。
ps->cost = ps->power * cpu_max_freq / ps->freq,对于小核的第一个频点对应的cost也就是 14 * 1800000 / 200000 = 126,但是dump出来的cost=126000,看来是乘以1000了。