C/C++教程

使用英特尔 VTune Profiler 进行挖矿CPU指令数据分析

本文主要是介绍使用英特尔 VTune Profiler 进行挖矿CPU指令数据分析,对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!

门罗币挖矿指令:

Collection and Platform Info
    Application Command Line:    D:\share\xmrig-6.18.0-msvc-win64\xmrig-6.18.0\xmrig.exe -o fr.minexmr.com:443 -u 4971qQbWrJRUGDvEUUvqsw29MNz68Cus7d6DAsmTmGoZd4o9AL9FAJiFSvo5uZK1ezguR46n689Rk3zApMZTcB3gQfDMULX -p x --tls
    Operating System:    Microsoft Windows 10
    Computer Name:    DESKTOP-ALRVTLS
    Result Size:    1.7 GB 采集的全量数据规模
    Collection start time:    15:29:48 02/08/2022 UTC
    Collection stop time:    15:32:55 02/08/2022 UTC
    Collector Type:    Event-based sampling driver
    Finalization mode: Fast. If the number of collected samples exceeds the threshold, this mode limits the number of processed samples to speed up post-processing.
    CPU
        Name:    Intel(R) microarchitecture code named Rocketlake
        Frequency:    2.6 GHz
        Logical CPU Count:    12
        Cache Allocation Technology
            Level 2 capability:    not detected
            Level 3 capability:    not detected

分析类型:

 

运行截图:

 

=

 

 

运行近2分钟,我们看下数据结果:

 

 

 

全量数据采集有1.7GB!还是比较恐怖的。。。

看下整体结果:

 

 

 

 但从性能上看的话,瓶颈在backend。

 

看看单点的retiring,主要的CPU指令都在做啥:

 

 

 

FP的浮点运算比较多,13%

 

front-end的,cache miss、分支预测失误这些,占比很少:

 

 

 

backend的,

 

 

 

Long-latency operations like divides and memory operations can cause this, as can too many operations being directed to a single execution port (for example, more multiply operations arriving in the back-end per cycle than the execution unit can support).

从描述看,是L2 cache拖后腿了,L1的100%,L2的太低,貌似是这个意思。

 

 

 

看下call stack,耗时最多的就1个module。

 

 

我们看下event count:

 

 

 

将hardware event type导出来:

Hardware Events
    Hardware Event Type	Hardware Event Count
    ARITH.DIVIDER_ACTIVE	571,366,714,095   ==>arith.divider_active [Cycles when divide unit is busy executing divide or square root operations. Accounts for integer and floating-point operations] baclears.any [Counts the total number when the front end is resteered, mainly when the BPU cannot provide a correct prediction
                                                                       [当除法单元忙于执行除法或平方根运算时循环。 整数和浮点运算的帐户] baclears.any [计算前端重新转向时的总数,主要是当BPU无法提供正确的预测时******除法、平方根运算,符合挖矿的特质!!!

    BACLEARS.ANY	24,000,720                ===》The BACLEARS event counts the number of times the front end is resteered, mainly when the Branch Prediction Unit cannot provide a correct prediction and this is corrected by the Branch Address Calculator at the front end. The BACLEARS.ANY event counts the number of baclears for any type of branch.

                                                        翻译过来是:BACLEARS 事件计算前端被重新引导的次数,主要是在分支预测单元无法提供正确预测并且由前端的分支地址计算器纠正时。 BACLEARS.ANY 事件计算任何类型分支的 baclears 数量。==》看来是分支预测miss哪里的!

    BR_INST_RETIRED.ALL_BRANCHES	179,656,042,170   ==>ALL_BRANCHES 计算退出的任何分支指令的数量。 分支预测预测分支目标并使处理器能够在知道分支真实执行路径之前很久就开始执行指令。 所有分支都使用分支预测单元 (BPU) 进行预测。 该单元不仅根据分支的 EIP,还根据执行到达该 EIP 的执行路径来预测目标地址。 BPU 可以有效地预测以下分支类型:条件分支、直接调用和跳转、间接调用和跳转、返回。
    BR_MISP_RETIRED.ALL_BRANCHES	695,542,005
    CPU_CLK_UNHALTED.DISTRIBUTED	2,762,526,000,000  ==》此事件在活动超线程(即 C0 中的超线程)之间分配循环计数。 超线程在执行 HLT 或 MWAIT 指令时变为非活动状态。 如果所有其他超线程都处于非活动状态(或禁用或不存在),则所有计数都归因于该超线程。 要在核心处于活动状态时获得完整计数,请将每个超线程的计数相加。
    CPU_CLK_UNHALTED.REF_TSC	2,522,358,800,000
    CPU_CLK_UNHALTED.THREAD	3,122,854,800,000
    CPU_CLK_UNHALTED.THREAD_P	3,103,054,654,575
    CYCLE_ACTIVITY.CYCLES_L1D_MISS	2,207,076,621,210 ==》Cycles while L1 cache miss demand load is outstanding.
    CYCLE_ACTIVITY.CYCLES_MEM_ANY	2,970,053,910,135
    CYCLE_ACTIVITY.STALLS_L1D_MISS	1,527,559,582,665
    CYCLE_ACTIVITY.STALLS_L2_MISS	226,650,679,950
    CYCLE_ACTIVITY.STALLS_L3_MISS	162,225,486,675
    CYCLE_ACTIVITY.STALLS_MEM_ANY	1,551,274,653,810
    CYCLE_ACTIVITY.STALLS_TOTAL	1,592,284,776,840
    DSB2MITE_SWITCHES.PENALTY_CYCLES	1,669,550,085
    DTLB_LOAD_MISSES.STLB_HIT:cmask=1	5,694,170,820
    DTLB_LOAD_MISSES.WALK_ACTIVE	84,254,527,560
    DTLB_STORE_MISSES.STLB_HIT:cmask=1	292,508,775
    DTLB_STORE_MISSES.WALK_ACTIVE	370,511,115
    EXE_ACTIVITY.1_PORTS_UTIL	273,300,409,950
    EXE_ACTIVITY.2_PORTS_UTIL	390,990,586,485
    EXE_ACTIVITY.BOUND_ON_STORES	195,000,585
    FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE	563,478,403,845
    FRONTEND_RETIRED.ANY_DSB_MISS	24,163,691,340
    FRONTEND_RETIRED.DSB_MISS	660,046,200
    FRONTEND_RETIRED.L2_MISS	24,001,680
    FRONTEND_RETIRED.LATENCY_GE_16	45,003,150
    FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1	25,053,253,605
    FRONTEND_RETIRED.LATENCY_GE_4	232,516,275
    ICACHE_16B.IFDATA_STALL	2,205,039,690
    ICACHE_64B.IFTAG_STALL	1,176,017,640
    IDQ.DSB_CYCLES_ANY	710,761,066,140
    IDQ.DSB_CYCLES_OK	619,500,929,250
    IDQ.DSB_UOPS	3,580,955,371,425
    IDQ.MITE_CYCLES_ANY	92,280,138,420
    IDQ.MITE_CYCLES_OK	67,200,100,800
    IDQ.MITE_UOPS	335,040,502,560
    IDQ.MS_SWITCHES	657,019,710
    IDQ.MS_UOPS	4,468,634,055
    IDQ_UOPS_NOT_DELIVERED.CORE	351,316,053,945
    IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE	38,835,116,505
    ILD_STALL.LCP	7,500,135
    INST_RETIRED.ANY	3,769,987,000,000
    INST_RETIRED.NOP	90,000,135
    INT_MISC.CLEAR_RESTEER_CYCLES	7,215,129,870
    INT_MISC.RECOVERY_CYCLES:cmask=1:e=yes	975,017,550
    INT_MISC.UOP_DROPPING	16,350,049,050
    L1D_PEND_MISS.FB_FULL	3,135,009,405
    L1D_PEND_MISS.FB_FULL_PERIODS	180,000,540
    L1D_PEND_MISS.L2_STALL	2,910,008,730
    L1D_PEND_MISS.PENDING	2,753,288,259,840
    L2_RQSTS.ALL_RFO	37,389,560,835
    L2_RQSTS.RFO_HIT	24,540,368,100
    LD_BLOCKS.STORE_FORWARD	3,000,090
    LD_BLOCKS_PARTIAL.ADDRESS_ALIAS	7,704,231,120
    MACHINE_CLEARS.COUNT	85,502,565
    MEM_INST_RETIRED.ALL_STORES	200,160,600,480
    MEM_INST_RETIRED.ANY	732,047,196,135
    MEM_INST_RETIRED.LOCK_LOADS	15,001,050
    MEM_INST_RETIRED.SPLIT_LOADS	9,000,270
    MEM_INST_RETIRED.SPLIT_STORES	12,000,360
    MEM_INST_RETIRED.STLB_MISS_LOADS	1,413,042,390
    MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT	600,330
    MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM	2,401,320
    MEM_LOAD_RETIRED.FB_HIT	136,277,038,725
    MEM_LOAD_RETIRED.L1_HIT	336,031,008,090
    MEM_LOAD_RETIRED.L1_MISS	60,759,911,385
    MEM_LOAD_RETIRED.L2_HIT	54,858,822,870
    MEM_LOAD_RETIRED.L3_HIT	4,997,549,265
    MEM_LOAD_RETIRED.L3_MISS	456,191,520
    OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD:cmask=4	9,735,029,205
    OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD	2,673,818,021,430
    OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO	1,002,168,006,495
    RESOURCE_STALLS.SCOREBOARD	5,067,152,010
    TOPDOWN.BACKEND_BOUND_SLOTS	9,234,752,770,425
    TOPDOWN.SLOTS	13,658,454,097,535
    UOPS_DECODED.DEC0	33,000,099,000
    UOPS_DECODED.DEC0:cmask=1	17,385,052,155
    UOPS_DISPATCHED.PORT_0	910,771,366,155
    UOPS_DISPATCHED.PORT_1	994,651,491,975
    UOPS_DISPATCHED.PORT_2_3	534,780,802,170
    UOPS_DISPATCHED.PORT_4_9	223,530,335,295
    UOPS_DISPATCHED.PORT_5	850,201,275,300
    UOPS_DISPATCHED.PORT_6	899,491,349,235
    UOPS_DISPATCHED.PORT_7_8	207,810,311,715
    UOPS_EXECUTED.CYCLES_GE_3	855,031,282,545
    UOPS_EXECUTED.THREAD	4,300,326,450,480
    UOPS_ISSUED.ANY	4,063,476,095,205
    UOPS_RETIRED.SLOTS	3,905,945,858,910

 

我++,太多了,写个程序排序下再分析。

TOPDOWN.SLOTS 13658454097535   ==》pass,分析用的吧
TOPDOWN.BACKEND_BOUND_SLOTS 9234752770425 ==》同上
UOPS_EXECUTED.THREAD 4300326450480  ==》Number of uops to be executed per-thread each cycle. 对挖矿检测应该没啥用
UOPS_ISSUED.ANY 4063476095205   ==>Uops that Resource Allocation Table (RAT) issues to Reservation Station (RS). 对挖矿检测应该没啥用
UOPS_RETIRED.SLOTS 3905945858910
INST_RETIRED.ANY 3769987000000
IDQ.DSB_UOPS 3580955371425
CPU_CLK_UNHALTED.THREAD 3122854800000
CPU_CLK_UNHALTED.THREAD_P 3103054654575
CYCLE_ACTIVITY.CYCLES_MEM_ANY 2970053910135
CPU_CLK_UNHALTED.DISTRIBUTED 2762526000000
L1D_PEND_MISS.PENDING 2753288259840
OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD 2673818021430
CPU_CLK_UNHALTED.REF_TSC 2522358800000
CYCLE_ACTIVITY.CYCLES_L1D_MISS 2207076621210
CYCLE_ACTIVITY.STALLS_TOTAL 1592284776840
CYCLE_ACTIVITY.STALLS_MEM_ANY 1551274653810
CYCLE_ACTIVITY.STALLS_L1D_MISS 1527559582665
OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO 1002168006495
UOPS_DISPATCHED.PORT_1 994651491975
UOPS_DISPATCHED.PORT_0 910771366155
UOPS_DISPATCHED.PORT_6 899491349235
UOPS_EXECUTED.CYCLES_GE_3 855031282545
UOPS_DISPATCHED.PORT_5 850201275300
MEM_INST_RETIRED.ANY 732047196135
IDQ.DSB_CYCLES_ANY 710761066140
IDQ.DSB_CYCLES_OK 619500929250
ARITH.DIVIDER_ACTIVE 571366714095
FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE 563478403845
UOPS_DISPATCHED.PORT_2_3 534780802170
EXE_ACTIVITY.2_PORTS_UTIL 390990586485
IDQ_UOPS_NOT_DELIVERED.CORE 351316053945
MEM_LOAD_RETIRED.L1_HIT 336031008090
IDQ.MITE_UOPS 335040502560
EXE_ACTIVITY.1_PORTS_UTIL 273300409950
CYCLE_ACTIVITY.STALLS_L2_MISS 226650679950
UOPS_DISPATCHED.PORT_4_9 223530335295
UOPS_DISPATCHED.PORT_7_8 207810311715
MEM_INST_RETIRED.ALL_STORES 200160600480
BR_INST_RETIRED.ALL_BRANCHES 179656042170
CYCLE_ACTIVITY.STALLS_L3_MISS 162225486675
MEM_LOAD_RETIRED.FB_HIT 136277038725
IDQ.MITE_CYCLES_ANY 92280138420
DTLB_LOAD_MISSES.WALK_ACTIVE 84254527560
IDQ.MITE_CYCLES_OK 67200100800
MEM_LOAD_RETIRED.L1_MISS 60759911385
MEM_LOAD_RETIRED.L2_HIT 54858822870
IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE 38835116505
L2_RQSTS.ALL_RFO 37389560835
UOPS_DECODED.DEC0 33000099000
FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1 25053253605
L2_RQSTS.RFO_HIT 24540368100
FRONTEND_RETIRED.ANY_DSB_MISS 24163691340
UOPS_DECODED.DEC0:cmask=1 17385052155
INT_MISC.UOP_DROPPING 16350049050
OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD:cmask=4 9735029205
LD_BLOCKS_PARTIAL.ADDRESS_ALIAS 7704231120
INT_MISC.CLEAR_RESTEER_CYCLES 7215129870
DTLB_LOAD_MISSES.STLB_HIT:cmask=1 5694170820
RESOURCE_STALLS.SCOREBOARD 5067152010
MEM_LOAD_RETIRED.L3_HIT 4997549265
IDQ.MS_UOPS 4468634055
L1D_PEND_MISS.FB_FULL 3135009405
L1D_PEND_MISS.L2_STALL 2910008730
ICACHE_16B.IFDATA_STALL 2205039690
DSB2MITE_SWITCHES.PENALTY_CYCLES 1669550085
MEM_INST_RETIRED.STLB_MISS_LOADS 1413042390
ICACHE_64B.IFTAG_STALL 1176017640
INT_MISC.RECOVERY_CYCLES:cmask=1:e=yes 975017550
BR_MISP_RETIRED.ALL_BRANCHES 695542005
FRONTEND_RETIRED.DSB_MISS 660046200
IDQ.MS_SWITCHES 657019710
MEM_LOAD_RETIRED.L3_MISS 456191520
DTLB_STORE_MISSES.WALK_ACTIVE 370511115
DTLB_STORE_MISSES.STLB_HIT:cmask=1 292508775
FRONTEND_RETIRED.LATENCY_GE_4 232516275
EXE_ACTIVITY.BOUND_ON_STORES 195000585
L1D_PEND_MISS.FB_FULL_PERIODS 180000540
INST_RETIRED.NOP 90000135
MACHINE_CLEARS.COUNT 85502565
FRONTEND_RETIRED.LATENCY_GE_16 45003150
FRONTEND_RETIRED.L2_MISS 24001680
BACLEARS.ANY 24000720
MEM_INST_RETIRED.LOCK_LOADS 15001050
MEM_INST_RETIRED.SPLIT_STORES 12000360
MEM_INST_RETIRED.SPLIT_LOADS 9000270
ILD_STALL.LCP 7500135
LD_BLOCKS.STORE_FORWARD 3000090
MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM 2401320
MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT 600330

 

明天再分析,眼睛都合不上了。。。。

 

这篇关于使用英特尔 VTune Profiler 进行挖矿CPU指令数据分析的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!