Android ANR深入分析
一:说明
本文针对相对比较复杂的anr问题。 复制代码
二:ANR分类说明
上图为笔者从android9的代码中梳理出的所有anr。
所有anr都与四大组件关联在一起,整体上可分为两类
与用户直接交互引起的anr,触发源头为input。
后台组件操作超时引起的anr。触发源头又分为两类:
第一类是服务和广播,都是直接在ams的线程中触发,
第二类是contentprovider,触发源头为发起contentprovider操作的进程主线程。
而所有anr的原因,可以分为以下几类:
受anr监听的事件本身超时,如touch事件处理超时。这类anr比较容易定位。
受anr监听的事件无法进入主线程开始处理,如主线程被block或任务忙,等待时间超时引起anr。
窗口状态不对,如绘制超时,焦点错误等引起。
ps:主线程阻塞了不一定会引起ANR,有些情况下,主线程可能阻塞了一些时间,但在这些时间内,没有事件进入,因此就不触发anr。
不同ANR事件的监听实现
input anr,通过InputDispatcher线程监听(直接用looper的epoll timeout,对于当前inputChannel,是阻塞式的,对其他channel则是非阻塞式的)
service anr,通过AMS的主线程(线程名:ActivityManager)监听(通过Handler的delay,非阻塞式)
broadcast anr,通过AMS的主线程(线程名:ActivityManager)监听(通过Handler的delay,非阻塞式)
ContentProvider anr,通过当前进程的主线程去监听,触发后调用AMS去触发anr(通过Handler的delay,非阻塞式)
下图为典型的input anr触发流程
InputDispatcherAPPMainThreadActivityManagersend event to app正常情况下,app finsh event 的时间很短app finish event handlesend event to appanr时,app主线程在超时时间内不向InputDispatcher返回状态send event to app在下一个event事件时,触发一个5s的计时,triger anr5s后如果还未收到finish,就触发anrInputDispatcherAPPMainThreadActivityManager
下图为典型的 service/broadcast anr触发流程
ActivityManagerAPPMainThread正常情况时:在跨进程之前,发一个超时的handler消息service/broadcast callams调用相关的操作app service/broadcast finish handle在跨进程之后,移出超时的handler消息ANR时:service/broadcast call应用一直无响应,超时消息到时后发出ams就执行对应的anr事件。ActivityManagerAPPMainThread
下图为典型的 contentProvider anr触发流程
APP1MainThreadAPP2MainThreadActivityManagerapp1在跨进程调用contentProvider前,发一个超时消息call contentProvider methodreturn在超时前收到 contentProvider 的调用,移出超时的handler消息在超时前未收到 contentProvider 的调用,则向ams触发anrtriger anrAPP1MainThreadAPP2MainThreadActivityManager
三:一些比较典型的复杂anr
1: 主线程空闲
"main" prio=5 tid=1 Native | group="main" sCount=1 dsCount=0 flags=1 obj=0x752a4b08 self=0x7204214c00 | sysTid=4434 nice=-10 cgrp=default sched=0/0 handle=0x7289e7b548 | state=S schedstat=( 857618499 263812800 679 ) utm=70 stm=15 core=1 HZ=100 | stack=0x7fccb8f000-0x7fccb91000 stackSize=8MB | held mutexes= kernel: (couldn't read /proc/self/task/4434/stack) native: #00 pc 000000000006e118 /system/lib64/libc.so (offset 6e000) (__epoll_pwait+8) native: #01 pc 0000000000013f8c /system/lib64/libutils.so (android::Looper::pollInner(int)+144) native: #02 pc 0000000000013e60 /system/lib64/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+60) native: #03 pc 000000000011f9c0 /system/lib64/libandroid_runtime.so (android::android_os_MessageQueue_nativePollOnce(_JNIEnv*, _jobject*, long, int)+44) at android.os.MessageQueue.nativePollOnce(Native method) at android.os.MessageQueue.next(MessageQueue.java:326) at android.os.Looper.loop(Looper.java:160) at android.app.ActivityThread.main(ActivityThread.java:6733) at java.lang.reflect.Method.invoke(Native method) at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:506) at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:858) 复制代码
如上trace,出现anr的进程的主线程显空闲,很多同学在遇到类anr的时候会比较无助,大多数情况下,应用开发的同学会直接转给做系统开发的同学继续跟进。 遇到这类问题时,先确认anr触发时间和trace dump 时间差多久,以为附录示例为参考
10-14 15:43:19.393在inputdispatcher中触发anr,
10-14 15:43:19.421 在ams中开始处理anr
10-14 15:43:19.934 开始dump anr进程的调用stack
10-14 15:43:20.239 完成 anr进程的调用栈dump.
在这个示例中,从inputdispatcher发现事件超时,到开始dump anr进程的trace,花了500+ms,到dump完成,总共花了850+ms,这次不算长,也不算短,但实际上在这次中,主线程dump出来的stack是处于空闲态的。当然,这次anr并不是因为主线已经过了block时间点,但实际开发debug中,有很多情况下是dump的时候,anr的进程已经过了阻塞点。此次只能基于logcag的信息,找到anr触发的具体时间,再去看主线程的log.不然就容易在错误的时间,基于错误的log的分析。
2:跨线程/进程阻塞
被其他线程block的是很容易定位的,trace中通过lock锁就可以轻易找到,这里就不展开了。
跨进程阻塞:(这里只包含跨进程的部分,更早的调用栈包含公司包名信息,就不贴出来了)
"main" prio=5 tid=1 Native | group="main" sCount=1 dsCount=0 flags=1 obj=0x73ec6b08 self=0x7e38414c00 | sysTid=2411 nice=-10 cgrp=default sched=0/0 handle=0x7ebdf9b548 | state=S schedstat=( 1110108606 4113781665 1463 ) utm=72 stm=38 core=0 HZ=100 | stack=0x7ffbe55000-0x7ffbe57000 stackSize=8MB | held mutexes= kernel: (couldn't read /proc/self/task/2411/stack) native: #00 pc 000000000006e218 /system/lib64/libc.so (__ioctl+4) native: #01 pc 0000000000029124 /system/lib64/libc.so (ioctl+136) native: #02 pc 000000000005a5e4 /system/lib64/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+256) native: #03 pc 000000000005b374 /system/lib64/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+60) native: #04 pc 000000000005b1c0 /system/lib64/libbinder.so (android::IPCThreadState::transact(int, unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+176) native: #05 pc 0000000000051318 /system/lib64/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+72) native: #06 pc 000000000006a47c /system/lib64/libmedia.so (android::BpMediaPlayerService::getCodecList() const+248) native: #07 pc 00000000001633b4 /system/lib64/libstagefright.so (android::MediaCodecList::getInstance() [clone .cfi]+332) native: #08 pc 0000000000034128 /system/lib64/libmedia_jni.so (android_media_MediaCodecList_getCodecCount(_JNIEnv*, _jobject*)+36) at android.media.MediaCodecList.native_getCodecCount(Native method) at android.media.MediaCodecList.initCodecList(MediaCodecList.java:82) 复制代码
这调用栈中,可以看到主线程处在binder调用中,具体是哪一个调用,一般看BpBinder::transact前一个调用的方法是什么,这里是getCodecList,那就在整个anr文件中搜哪个进程在被调用这个方法。
"Binder:658_1" sysTid=951 #00 pc 00053fac /system/lib/libc.so (__ioctl+8) #01 pc 00021c59 /system/lib/libc.so (ioctl+36) #02 pc 00015def /system/lib/libhwbinder.so (android::hardware::IPCThreadState::talkWithDriver(bool)+190) #03 pc 0001696d /system/lib/libhwbinder.so (android::hardware::IPCThreadState::waitForResponse(android::hardware::Parcel*, int*)+268) #04 pc 00015bb7 /system/lib/libhwbinder.so (android::hardware::BpHwBinder::transact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+90) #05 pc 00028cfb /system/lib/android.hardware.media.omx@1.0.so (android::hardware::media::omx::V1_0::BpHwOmx::_hidl_allocateNode(android::hardware::IInterface*, android::hardware::details::HidlInstrumentor*, android::hardware::hidl_string const&, android::sp<android::hardware::media::omx::V1_0::IOmxObserver> const&, std::__1::function<void (android::hardware::media::omx::V1_0::Status, android::sp<android::hardware::media::omx::V1_0::IOmxNode> const&)>)+406) #06 pc 000296ef /system/lib/android.hardware.media.omx@1.0.so (android::hardware::media::omx::V1_0::BpHwOmx::allocateNode(android::hardware::hidl_string const&, android::sp<android::hardware::media::omx::V1_0::IOmxObserver> const&, std::__1::function<void (android::hardware::media::omx::V1_0::Status, android::sp<android::hardware::media::omx::V1_0::IOmxNode> const&)>)+78) #07 pc 0001dceb /system/lib/libmedia_omx.so (android::hardware::media::omx::V1_0::utils::LWOmx::allocateNode(char const*, android::sp<android::IOMXObserver> const&, android::sp<android::IOMXNode>*)+98) #08 pc 0005d51d /system/lib/libstagefright.so (android::ACodec::queryCapabilities(char const*, char const*, char const*, bool, android::MediaCodecInfo::CapabilitiesWriter*)+200) #09 pc 000a0115 /system/lib/libstagefright.so (android::(anonymous namespace)::queryCapabilities(android::hardware::media::omx::V1_0::IOmxStore::NodeInfo const&, char const*, bool, android::MediaCodecInfo::CapabilitiesWriter*)+76) #10 pc 0009fed9 /system/lib/libstagefright.so (android::OmxInfoBuilder::buildMediaCodecList(android::MediaCodecListWriter*)+996) #11 pc 0008ffc3 /system/lib/libstagefright.so (android::MediaCodecList::MediaCodecList(std::__1::vector<android::MediaCodecListBuilderBase*, std::__1::allocator<android::MediaCodecListBuilderBase*>>)+234) #12 pc 0009015d /system/lib/libstagefright.so (android::MediaCodecList::getLocalInstance()+68) #13 pc 00039279 /system/lib/libmediaplayerservice.so (android::MediaPlayerService::getCodecList() const+36) #14 pc 00033ce9 /system/lib/libmedia.so (android::BnMediaPlayerService::onTransact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+640) #15 pc 000361cf /system/lib/libbinder.so (android::BBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+70) #16 pc 0003da17 /system/lib/libbinder.so (android::IPCThreadState::executeCommand(int)+410) #17 pc 0003d7a3 /system/lib/libbinder.so (android::IPCThreadState::getAndExecuteCommand()+106) #18 pc 0003dccb /system/lib/libbinder.so (android::IPCThreadState::joinThreadPool(bool)+38) #19 pc 0005470d /system/lib/libbinder.so (android::PoolThread::threadLoop()+12) #20 pc 0000c1a3 /system/lib/libutils.so (android::Thread::_threadLoop(void*)+166) #21 pc 00063cb5 /system/lib/libc.so (__pthread_start(void*)+22) #22 pc 0001e0ad /system/lib/libc.so (__start_thread+22) 复制代码
搜索发现,在/system/bin/mediaserver进程中,有多个 getCodecList 被调用,这是一个互斥的调用,在此进程中,只有上面显示的这个调用栈在向下调用。 在这个调用中,可以看这个调用最后也是一个binder调用,也处在被其他进程block的状态。 那下一步就是继续找BpHwBinder::transact 的前一个调用的是什么方法,从trace中看是_hidl_allocateNode 接下来继续去追 _hidl_allocateNode的调用。直到找到真正的block点。
附录:一个完整的input anr 流程分析
下面通过一个完整的input anr触发流程,展示一个anr问题的分析过程(简单的anr并不需要完整的分析流程)这个流程分成5步,很多人在anr的trace文件中没有看到直接的问题点时,就直接从第5步所在位置开始分析,并从这个时间点向前推算,但这可能并不准确。就如示例中的一样,ActivityManager : ANR in所在时间点事实上与anr触发时有5s的差距。
正确的步骤应该是从Event log am_anr开始,如果没有event log,则向前看,直接找InputDispatcher,如果是input事件超时,会有一个wait queue age,用这个时间向前推,则可以找到精确的input时间。
找到精确的input时间后,再通过anr进程的主线程log,去看这个时间点上,主线程做了什么,如果前后都没有主线程的log,则有可能是阻塞了,再找anr前最近的主线程log去看,大概率就是那行log之后的操作引起的。
触发anr的事件
10-14 15:43:14.386 3693 3706 D (input事件,具体log包含公司信息,就不贴出来了)
InputDispatcher 检测到事件超时,触发anr事件
默认超时时间为5s,超时方式为epoll_wait超时
10-14 15:43:19.393 1946 2036 I InputDispatcher : Application is not responding: AppWindowToken{d0046c6 token=Token{7a154a1 ActivityRecord{e26d508 u0 com.XXX.XXX/.activity.MainActivity t49}}}. It has been 5006.9ms since event, 5006.2ms since wait started. Reason: Waiting because no window has focus but there is a focused application that may eventually add a window when it finishes starting up. 10-14 15:43:19.394 1946 2036 I WindowManager : Input event dispatching timed out sending to application AppWindowToken{d0046c6 token=Token{7a154a1 ActivityRecord{e26d508 u0 com.XXX.XXX/.activity.MainActivity t49}}}. Reason: Waiting because no window has focus but there is a focused application that may eventually add a window when it finishes starting up. 复制代码
AMS收到anr事件,开始处理anr事件
10-14 15:43:19.421 1946 1961 I am_anr : [0,4827,com.XXX.XXX,550026829,Input dispatching timed out (Waiting because no window has focus but there is a focused application that may eventually add a window when it finishes starting up.)] 复制代码
开始依次dump当前进程的调用栈
10-14 15:43:19.934 1946 1961 I system_server : libdebuggerd_client: started dumping process 4827 10-14 15:43:20.239 1946 1961 I system_server : libdebuggerd_client: done dumping process 4827 10-14 15:43:20.239 1946 1961 I system_server : libdebuggerd_client: started dumping process 1946 10-14 15:43:20.240 1946 1951 I system_server : Thread[2,tid=1951,WaitingInMainSignalCatcherLoop,Thread*=0x6f3f229000,peer=0x12d00000,"Signal Catcher"]: reacting to signal 3 10-14 15:43:20.240 1946 1951 I system_server : 10-14 15:43:20.421 1946 1951 I system_server : Wrote stack traces to '[tombstoned]' 10-14 15:43:20.421 1946 1961 I system_server : libdebuggerd_client: done dumping process 1946 ... 10-14 15:43:26.570 1946 1961 I system_server : libdebuggerd_client: started dumping process 5840 10-14 15:43:26.716 1946 1961 I system_server : libdebuggerd_client: done dumping process 5840 复制代码
完成调用栈,把cpu的负载状态打印出来
10-14 15:43:26.757 1946 1961 E ActivityManager : ANR in com.XXX.XXX (com.XXX.XXX/.activity.MainActivity) 10-14 15:43:26.757 1946 1961 E ActivityManager : PID: 4827 10-14 15:43:26.757 1946 1961 E ActivityManager : Reason: Input dispatching timed out (Waiting because no window has focus but there is a focused application that may eventually add a window when it finishes starting up.) 10-14 15:43:26.757 1946 1961 E ActivityManager : Load: 10.53 / 4.93 / 1.85 10-14 15:43:26.757 1946 1961 E ActivityManager : CPU usage from 0ms to 7320ms later (2022-10-14 15:43:19.396 to 2022-10-14 15:43:26.716): 10-14 15:43:26.757 1946 1961 E ActivityManager : 28% 2227/com.XXX.XXX: 16% user + 12% kernel / faults: 7991 minor 8 major 10-14 15:43:26.757 1946 1961 E ActivityManager : 19% 530/surfaceflinger: 9.5% user + 9.6% kernel / faults: 993 minor 10-14 15:43:26.757 1946 1961 E ActivityManager : 14% 6036/adbd: 1.7% user + 12% kernel / faults: 154514 minor 10-14 15:43:26.757 1946 1961 E ActivityManager : 11% 533/vendor.XXX.hardware.XXX@1.0-service: 9.6% user + 2% kernel / faults: 2508 minor 10-14 15:43:26.757 1946 1961 E ActivityManager : 10% 1946/system_server: 3.9% user + 6.6% kernel / faults: 4631 minor 复制代码
至此,和anr相关的log信信息已经取完。
实际分析问题时,可以结合上面的5个时间节点去找对应的信息。
作者:lizhenjun114
链接:https://juejin.cn/post/7170975699593855012