处理方法
gaussdb=# select * from pv_total_memory_detail;
nodename | memorytype | memorymbytes
coordinator1 | max_process_memory | 81920
coordinator1 | process_used_memory | 14567
coordinator1 | max_dynamic_memory | 34012
coordinator1 | dynamic_used_memory | 1851
coordinator1 | dynamic_peak_memory | 3639
coordinator1 | dynamic_used_shrctx | 394
coordinator1 | dynamic_peak_shrctx | 399
coordinator1 | max_backend_memory | 648
coordinator1 | backend_used_memory | 1
coordinator1 | max_shared_memory | 46747
coordinator1 | shared_used_memory | 11618
coordinator1 | max_cstore_memory | 512
coordinator1 | cstore_used_memory | 0
coordinator1 | max_sctpcomm_memory | 0
coordinator1 | sctpcomm_used_memory | 0
coordinator1 | sctpcomm_peak_memory | 0
coordinator1 | other_used_memory | 1013
coordinator1 | gpu_max_dynamic_memory | 0
coordinator1 | gpu_dynamic_used_memory | 0
coordinator1 | gpu_dynamic_peak_memory | 0
coordinator1 | pooler_conn_memory | 0
coordinator1 | pooler_freeconn_memory | 0
coordinator1 | storage_compress_memory | 0
coordinator1 | udf_reserved_memory | 0
rows)
查看数据库进程全局的内存上下文占用大小,按照内存上下文分类从大到小排序,取top10即可。
gaussdb=# select contextname, sum(totalsize)/1024/1024 sum, sum(freesize)/1024/1024, count(*) count from pg_shared_memory_detail group by contextname order by sum desc limit 10;
contextname | sum | ?column? | count
-----------------------------------+----------------------+-----------------------+-------
IncreCheckPointContext | 250.8796234130859375 | .00273132324218750000 | 1
AshContext | 64.0950317382812500 | .00772094726562500000 | 1
DefaultTopMemoryContext | 60.5699005126953125 | 1.0594177246093750 | 1
StorageTopMemoryContext | 16.7601776123046875 | .05357360839843750000 | 1
GlobalAuditMemory | 16.0081176757812500 | .00769042968750000000 | 1
CBBTopMemoryContext | 14.9503479003906250 | .04009246826171875000 | 1
Undo | 8.6680450439453125 | .21752929687500000000 | 1
DoubleWriteContext | 6.5549163818359375 | .02331542968750000000 | 1
ThreadPoolContext | 5.4042663574218750 | .00525665283203125000 | 1
GlobalSysDBCacheEntryMemCxt_16384 | 4.2232666015625000 | .89799499511718750000 | 16
(10 rows)
查看数据库进程所有线程的内存上下文占用大小,按照内存上下文分类从大到小排序,取top10即可。
gaussdb=# select contextname, sum(totalsize)/1024/1024 sum, sum(freesize)/1024/1024, count(*) count from pv_thread_memory_context group by contextname order by sum desc limit 10;
contextname | sum | ?column? | count
---------------------------------+----------------------+-----------------------+-------
LocalSysCacheShareMemoryContext | 612.5096435546875000 | 57.4630737304687500 | 543
StorageTopMemoryContext | 311.8157348632812500 | 3.2519149780273438 | 543
DefaultTopMemoryContext | 168.5756530761718750 | 10.7153015136718750 | 543
LocalSysCacheMyDBMemoryContext | 167.4375000000000000 | 65.7499847412109375 | 543
ThreadTopMemoryContext | 161.4440002441406250 | 4.0309295654296875 | 543
CBBTopMemoryContext | 109.1161880493164063 | 6.7845993041992188 | 543
LocalSysCacheTopMemoryContext | 93.4109802246093750 | 13.2236938476562500 | 543
Timezones | 43.2421417236328125 | 1.4333953857421875 | 543
gs_signal | 32.2394561767578125 | 4.9155120849609375 | 1
Type information cache | 22.9119262695312500 | .86848449707031250000 | 329
(10 rows)
查看数据库进程所有session的内存上下文占用大小,按照内存上下文分类从大到小排序,取top10即可。
gaussdb=# select contextname, sum(totalsize)/1024/1024 sum, sum(freesize)/1024/1024, count(*) count from pv_session_memory_context group by contextname order by sum desc limit 10;
contextname | sum | ?column? | count
-------------------------+----------------------+-----------------------+-------
CachedPlan | 223.4433593750000000 | 64.6083068847656250 | 12394
CachedPlanQuery | 134.7382812500000000 | 42.3366699218750000 | 12596
SessionTopMemoryContext | 132.3496398925781250 | 25.9272155761718750 | 302
CachedPlanSource | 98.6943359375000000 | 28.3841018676757813 | 12897
CBBTopMemoryContext | 60.6870880126953125 | 3.0470962524414063 | 302
GenericRoot | 35.1962890625000000 | 14.1624069213867188 | 471
Timezones | 24.0499572753906250 | .79721069335937500000 | 302
SPI Plan | 21.0664062500000000 | 6.8149719238281250 | 2396
AdaptiveCachedPlan | 17.5449218750000000 | 4.7733078002929688 | 546
Prepared Queries | 16.4062500000000000 | 7.5508117675781250 | 300
(10 rows)
gaussdb=# select * from gs_get_history_memory_detail(NULL) order by memory_info desc limit 10;
memory_info
-------------------------------
mem_log-2023-03-10_205125.log
mem_log-2023-03-10_205115.log
mem_log-2023-03-10_205104.log
mem_log-2023-03-10_205054.log
mem_log-2023-03-10_205043.log
mem_log-2023-03-10_205032.log
mem_log-2023-03-10_205022.log
mem_log-2023-03-10_205012.log
mem_log-2023-03-10_205002.log
mem_log-2023-03-10_204951.log
(10 rows)
选取其中一个log文件,执行如下查询语句即可阅览log内容,记载了全局的内存概况与全局级内存上下文,线程级内存上下文,session级内存上下的top20内存上下文占用详情,如下所示。
gaussdb=# select * from gs_get_history_memory_detail('mem_log-2023-03-10_205125.log');
memory_info
--------------------------------------------------------------------------------------
{
"Global Memory Statistics": {
"Max_dynamic_memory": 34012,
"Dynamic_used_memory": 3645,
"Dynamic_peak_memory": 3664,
"Dynamic_used_shrctx": 401,
"Dynamic_peak_shrctx": 401,
"Max_backend_memory": 648,
"Backend_used_memory": 1,
"other_used_memory": 0
},
"Memory Context Info": {
"Memory Context Detail": {
"Context Type": "Shared Memory Context",
"Memory Context": {
"context": "IncreCheckPointContext",
"freeSize": 2864,
"totalSize": 263066352
},
...
},
"Memory Context Detail": {
"Context Type": "Session Memory Context",
"Memory Context": {
"context": "CachedPlan",
"freeSize": 68041368,
"totalSize": 235937792
},
...
},
"Memory Context Detail": {
"Context Type": "Thread Memory Context",
"Memory Context": {
"context": "LocalSysCacheShareMemoryContext",
"freeSize": 60431360,
"totalSize": 644141760
},
...
}
}
(322 rows)
根据获取内存统计信息中查询获得的内存占用概况可分析如下:
如果dynamic_used_memory较大,dynamic_used_shrctx较小,则可以确认是线程和session上内存占用较多。
如果dynamic_used_memory较大,dynamic_used_shrctx和dynamic_used_memory相差不大,则可以确认是全局内存上下文使用的动态内存较大。
如果只有shared_used_memory占用较大,则可以确认是共享内存占用较多,忽略即可。
如果是other_used_memory较大,一般情况是由于业务执行时频繁的内存申请和释放导致内存碎片缓存过多。
针对这几种种情况,分别按照下面的4类定位方法定位即可。
a.全局内存上下文占用较高
有现场环境
查询如下语句即可确认是哪个内存上下文占用内存较高。
gaussdb=# select contextname, sum(totalsize)/1024/1024 sum, sum(freesize)/1024/1024, count(*) count from pg_shared_memory_detail group by contextname order by sum desc limit 10;
contextname | sum | ?column? | count
-----------------------------------+----------------------+-----------------------+-------
IncreCheckPointContext | 250.8796234130859375 | .00273132324218750000 | 1
AshContext | 64.0950317382812500 | .00772094726562500000 | 1
DefaultTopMemoryContext | 60.5699005126953125 | 1.0594177246093750 | 1
StorageTopMemoryContext | 16.7601776123046875 | .04942321777343750000 | 1
GlobalAuditMemory | 16.0081176757812500 | .00769042968750000000 | 1
CBBTopMemoryContext | 14.9503479003906250 | .04009246826171875000 | 1
Undo | 8.6680450439453125 | .20516967773437500000 | 1
DoubleWriteContext | 6.5549163818359375 | .02331542968750000000 | 1
ThreadPoolContext | 5.3873443603515625 | .00525665283203125000 | 1
GlobalSysDBCacheEntryMemCxt_16384 | 4.3115692138671875 | 1.0470581054687500 | 16
(10 rows)
确定内存上下文之后,以IncreCheckPointContext为例,查询视图gs_get_shared_memctx_detail,确定内存堆积的代码位置。
gaussdb=# select * from gs_get_shared_memctx_detail('IncreCheckPointContext');
file | line | size
-------------------------+------+-----------
ipci.cpp | 476 | 64
pagewriter.cpp | 298 | 1024
ipci.cpp | 498 | 4096
pagewriter.cpp | 322 | 19632000
pagewriter.cpp | 317 | 33669120
storage_buffer_init.cpp | 90 | 209756160
(6 rows)
从上述查询结果可以看出,在代码storage_buffer_init.cpp的90行申请了大量的内存,可能存在内存堆积不释放的问题。
gaussdb=# select contextname, sum(totalsize)/1024/1024 sum, sum(freesize)/1024/1024, count(*) count from pv_thread_memory_context group by contextname order by sum desc limit 10;
contextname | sum | ?column? | count
---------------------------------+----------------------+-----------------------+-------
LocalSysCacheShareMemoryContext | 641.0926513671875000 | 60.0820159912109375 | 543
StorageTopMemoryContext | 311.8157348632812500 | 3.1896591186523438 | 543
LocalSysCacheMyDBMemoryContext | 175.0625000000000000 | 65.0446166992187500 | 543
DefaultTopMemoryContext | 168.5756530761718750 | 10.7153015136718750 | 543
ThreadTopMemoryContext | 161.9752502441406250 | 4.1196441650390625 | 543
CBBTopMemoryContext | 109.1161880493164063 | 6.7845993041992188 | 543
LocalSysCacheTopMemoryContext | 93.4109802246093750 | 13.2236938476562500 | 543
Timezones | 43.2421417236328125 | 1.4333953857421875 | 543
gs_signal | 32.2394561767578125 | 4.9155120849609375 | 1
Type information cache | 23.8869018554687500 | .90544128417968750000 | 343
(10 rows)
确定内存上下文之后,以StorageTopMemoryContext为例,查询视图gs_get_thread_memctx_detail(第一个入参为线程ID,可以通过查询视图gs_thread_memory_context获得 ),确定内存堆积的代码位置。
gaussdb=# select * from gs_get_thread_memctx_detail(140639273547520,'StorageTopMemoryContext');
file | line | size
--------------+------+--------
syncrep.cpp | 1608 | 32
elog.cpp | 2008 | 16
fd.cpp | 2734 | 128
syncrep.cpp | 1568 | 32
deadlock.cpp | 175 | 512
deadlock.cpp | 169 | 342656
deadlock.cpp | 157 | 85664
deadlock.cpp | 146 | 21416
deadlock.cpp | 144 | 32112
deadlock.cpp | 136 | 10712
deadlock.cpp | 135 | 10712
deadlock.cpp | 128 | 85664
deadlock.cpp | 126 | 21416
(13 rows)
gaussdb=# select contextname, sum(totalsize)/1024/1024 sum, sum(freesize)/1024/1024, count(*) count from pv_session_memory_context group by contextname order by sum desc limit 10;
contextname | sum | ?column? | count
----------------------------+----------------------+-----------------------+-------
CachedPlan | 226.1093750000000000 | 67.1747817993164063 | 12450
CachedPlanQuery | 134.8027343750000000 | 41.8541030883789063 | 12612
SessionTopMemoryContext | 132.1605682373046875 | 26.1002349853515625 | 301
CachedPlanSource | 98.7617187500000000 | 28.4135513305664063 | 12912
CBBTopMemoryContext | 60.4861373901367188 | 3.0370101928710938 | 301
Timezones | 23.9703216552734375 | .79457092285156250000 | 301
SPI Plan | 21.1307907104492188 | 6.8435440063476563 | 2412
GenericRoot | 19.9628906250000000 | 7.7032165527343750 | 374
Prepared Queries | 16.4062500000000000 | 7.5508117675781250 | 300
unnamed prepared statement | 14.3437500000000000 | 6.6462554931640625 | 300
(10 rows)
确定内存上下文之后,以CachedPlan为例,查询视图gs_get_session_memctx_detail,确定内存堆积的代码位置。
gaussdb=# select * from gs_get_session_memctx_detail('CachedPlanQuery');
file | line | size
---------------+------+---------
copyfuncs.cpp | 2607 | 5031680
copyfuncs.cpp | 7013 | 4176736
copyfuncs.cpp | 7016 | 2088368
copyfuncs.cpp | 5062 | 6918144
copyfuncs.cpp | 3461 | 403552
copyfuncs.cpp | 3397 | 2727104
copyfuncs.cpp | 3401 | 487368
datum.cpp | 150 | 2048
copyfuncs.cpp | 2572 | 1113728
copyfuncs.cpp | 6204 | 32
copyfuncs.cpp | 6206 | 32
copyfuncs.cpp | 7021 | 4267200
copyfuncs.cpp | 7037 | 2832000
copyfuncs.cpp | 7048 | 2066400
bitmapset.cpp | 94 | 134400
copyfuncs.cpp | 3430 | 96000
copyfuncs.cpp | 2847 | 2150400
copyfuncs.cpp | 2551 | 5126400
copyfuncs.cpp | 3984 | 105600
list.cpp | 105 | 254400
list.cpp | 108 | 796800
copyfuncs.cpp | 3835 | 7065600
copyfuncs.cpp | 2451 | 1056000
copyfuncs.cpp | 2453 | 244800
copyfuncs.cpp | 3840 | 230400
copyfuncs.cpp | 2895 | 1113600
copyfuncs.cpp | 3442 | 38400
copyfuncs.cpp | 2645 | 115200
list.cpp | 166 | 19200
namespace.cpp | 3853 | 144000
list.cpp | 1460 | 288000
copyfuncs.cpp | 2910 | 38400
copyfuncs.cpp | 2762 | 1075200
copyfuncs.cpp | 3953 | 67200
copyfuncs.cpp | 3000 | 96000
copyfuncs.cpp | 5876 | 28800
copyfuncs.cpp | 2619 | 2400
(37 rows)
从上述查询结果可以看出,在代码copyfuncs.cpp的3835行申请了大量的内存,可能存在内存堆积不释放的问题。
gaussdb=# select * from pv_total_memory_detail;
nodename | memorytype | memorymbytes
coordinator1 | max_process_memory | 81920
coordinator1 | process_used_memory | 24567
coordinator1 | max_dynamic_memory | 34012
coordinator1 | dynamic_used_memory | 1851
coordinator1 | dynamic_peak_memory | 3639
coordinator1 | dynamic_used_shrctx | 394
coordinator1 | dynamic_peak_shrctx | 399
coordinator1 | max_backend_memory | 648
coordinator1 | backend_used_memory | 1
coordinator1 | max_shared_memory | 46747
coordinator1 | shared_used_memory | 11618
coordinator1 | max_cstore_memory | 512
coordinator1 | cstore_used_memory | 0
coordinator1 | max_sctpcomm_memory | 0
coordinator1 | sctpcomm_used_memory | 0
coordinator1 | sctpcomm_peak_memory | 0
coordinator1 | other_used_memory | 11013
coordinator1 | gpu_max_dynamic_memory | 0
coordinator1 | gpu_dynamic_used_memory | 0
coordinator1 | gpu_dynamic_peak_memory | 0
coordinator1 | pooler_conn_memory | 0
coordinator1 | pooler_freeconn_memory | 0
coordinator1 | storage_compress_memory | 0
coordinator1 | udf_reserved_memory | 0
rows)
其他原因导致内存未及时释放
此处需要注意:other_used_memory过大不全部都是因为内存碎片导致的,也可能是如下原因:
1)业务代码中存在没有在内存上下文上申请内存直接使用了malloc接口申请内存的地方,且出现了内存堆积。
2)第三方开源软件存在内存未及时释放的场景。
出现这两种情况时,需要联系华为工程师协助解决。
3.解决方案
内存堆积导致内存满
方案:出现内存堆积长时间不释放时,需要通过做主备切换来降低内存的使用。
业务原因导致内存满
方案:修改客户端作业,降低并发数或者修改SQL语句,使其在执行时不占用大量内存,请联系华为工程师协助给出详细的解决方案。
other内存缓存过多导致内存满
方案一:如果是由于业务场景导致的other内存缓存过高,则可以通过调整执行计划相关的参数或者从客户端侧调整业务来解决内存过高的问题,需要根据具体业务场景确定修改方案,请联系华为工程师协助给出详细的解决方案。
方案二:出现内存堆积长时间不释放时,且无法通过调整业务来降低内存时则需要通过做主备切换来降低内存的使用。
如果您发现该资源为电子书等存在侵权的资源或对该资源描述不正确等,可点击“私信”按钮向作者进行反馈;如作者无回复可进行平台仲裁,我们会在第一时间进行处理!
加入交流群
请使用微信扫一扫!