Android性能优化-App卡顿监控技术方案及原理

发表于 2019-12-12 分类于 Android

APP经过长期迭代后随着业务的臃肿就会逐渐呈现出某些点位上有阻塞卡顿的地方，这就需要我们要采取一些行之有效的方案去监控发现，并尽早介入解决。

概述

目前业界主流的几种有效的监控方式如下：

子线程不断轮询主线程。
通过Looper Printer计算打印日志的时间差
Choreographer FrameCallback
插桩的方式对函数的出入口进行记录

子线程不断轮询主线程

我们可以开一个子线程不断去轮询主线程，原理和实现方法也很简单：就是不断向主线程发送Message，每隔一段时间检查一次刚刚发送的消息是否被处理，如果没被处理，说明这段时间主线程被卡住了。

这种方式优点就是：实现简单，能够监控各种类型的卡顿，缺点就是：使用轮询方式，不够优雅，而且轮询时间长短不好确定，时间间隔越短，对性能影响越大，反之，容易漏报。

原因：如我的轮询间隔设了3s，在1.5s~4.5s发生了卡顿，我是监测不到的，因为0～3s 和 3s ~ 6s 都有不卡顿的地方，发送的Message 都能被处理掉, 所以当我设置卡顿阈值为 3s 时, 这个卡顿就被漏报了。没什么特别好办法，只能调整时间阈值与漏报率达到一个平衡。

示例代码：

class UiMonitorThread implements Runnable {
    @Override public void run() {
        while (isRunning) {
            // 每隔 1.5s 往主线程发一次消息
            uiMonitorHandler.sendEmptyMessage(id);
            try {
                Thread.sleep(1500);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
            // 如果连续两次消息都没被处理掉，则认为发生了卡顿
            checkMessageHandled();
        }
    }
}

通过Looper Printer计算打印日志的时间差

我们可以使用系统方法 setMessageLogging 替换掉主线程 Looper 的 Printer 对象，通过计算 Printer 打印日志的时间差，来拿到系统 dispatchMessage 方法的执行时间。

1
2
3

Looper.getMainLooper().setMessageLogging(str -> {
    // 计算相邻两次日志时间间隔
});

这种方式的优点就是：实现简单，不会漏报，缺点就是，一些类型的卡顿无法被监控到。

系统Looper.loop方法源码片段

private static boolean loopOnce(final Looper me,
            final long ident, final int thresholdOverride) {
    Message msg = me.mQueue.next(); // might block
    if (msg == null) {
        // No message indicates that the message queue is quitting.
        return false;
    }

    // This must be in a local variable, in case a UI event sets the logger
    final Printer logging = me.mLogging;
    if (logging != null) {
        logging.println(">>>>> Dispatching to " + msg.target + " "
                        + msg.callback + ": " + msg.what);
    }
    //......
    msg.target.dispatchMessage(msg);
    //......
    if (logging != null) {
        logging.println("<<<<< Finished to " + msg.target + " " + msg.callback);
    }

    // Make sure that during the course of dispatching the
    // identity of the thread wasn't corrupted.
    final long newIdent = Binder.clearCallingIdentity();
    if (ident != newIdent) {
        Log.wtf(TAG, "Thread identity changed from 0x"
                + Long.toHexString(ident) + " to 0x"
                + Long.toHexString(newIdent) + " while dispatching to "
                + msg.target.getClass().getName() + " "
                + msg.callback + " what=" + msg.what);
    }

    msg.recycleUnchecked();

    return true;
}

通过代码可看到，仅监控 dispatchMessage 并不能cover 住所有卡顿，mQueue.next 注释很清楚了，might block。其中包括：nativePollOnce 方法和 idler.queueIdle()方法。其中

MessageQueue.next()方法片段

@UnsupportedAppUsage
Message next() {
    // Return here if the message loop has already quit and been disposed.
    // This can happen if the application tries to restart a looper after quit
    // which is not supported.
    final long ptr = mPtr;
    if (ptr == 0) {
        return null;
    }
    int pendingIdleHandlerCount = -1; // -1 only during first iteration
    int nextPollTimeoutMillis = 0;
    for (;;) {
        if (nextPollTimeoutMillis != 0) {
            Binder.flushPendingCommands();
        }
        nativePollOnce(ptr, nextPollTimeoutMillis);
        //......
    }
    //......
}

nativePollOnce 方法很重要，除了主线程空闲时会阻塞在这里，view 的touch事件也都是在这里被处理的。所以如果应用内包含了很多自定义 view，或处理了很多 onTouch 事件，就很难接受了。

不仅这样，Native Message 也会卡在 nativePollOnce 方法内，所以同样无法监控到。

queueIdle() 方法会在主线程空闲的时候被调用，所以如果我们在这里有耗时操作，也有可能引起卡顿的，而这种卡顿同样无法监控。

另一种引起卡顿的场景：就是常说的同步屏障了（第一次听到这个名字一脸懵逼）。我们 Message 默认都是同步消息，当我们调用 invalidate 来刷新UI 时，最终都会调用到 ViewRootImpl中的scheduleTraversals 方法，会向主线程 Looper postSyncBarrier 插入同步屏障消息，目的是刷新 UI 时，让 Looper 中的同步消息都被跳过，使渲染UI的同步屏障消息得到优先处理。

void scheduleTraversals() {
    if (!mTraversalScheduled) {
        mTraversalScheduled = true;
        mTraversalBarrier = mHandler.getLooper().getQueue().postSyncBarrier();
        mChoreographer.postCallback(
                Choreographer.CALLBACK_TRAVERSAL, mTraversalRunnable, null);
        notifyRendererOfFramePending();
        pokeDrawLockIfNeeded();
    }
}
void unscheduleTraversals() {
    if (mTraversalScheduled) {
        mTraversalScheduled = false;
        mHandler.getLooper().getQueue().removeSyncBarrier(mTraversalBarrier);
        mChoreographer.removeCallbacks(
                Choreographer.CALLBACK_TRAVERSAL, mTraversalRunnable, null);
    }
}

为啥说同步屏障会引起卡顿了，根据代码可看到，scheduleTraversals 方法和 unscheduleTraversals 是配对的，但都不是线程安全的方法。如果在异步线程 invalidate，导致多次执行 scheduleTraversals 方法，而 unscheduleTraversals 又只能移除最后的 mTraversalBarrier，那就会造成主线程的 Looper 的同步消息一直得不到处理，从而引起卡死。

虽然说了这么多问题，但是呢，作为一个主流的监控方案，一些缺陷已经有了解决方案。

nativePollOnce 的 onTouchEvent监控

我们可以通过ELF Hook, hook 到 libinput.so 的 recvform 和 sendto 方法，用我们自己的方法替换，在这里做监控，当调用 recvform 方法时，说明我们的应用接收到了 onTouch 事件，当被调用 sendto 方法时，说明 onTouch 事件已经被消费。

IdleHandler#queueIdle 监控

看源码可知，ArrayList mIdleHandlers 保存着全部我们所需的 IdleHandler，那么我们完全可以通过反射赋值成我们自己的MyArrayList，并重写 MyArrayList 的 add 方法，是不是就可以监控到每个被添加的 IdleHandler呢？

Message next() {
    // Return here if the message loop has already quit and been disposed.
    // This can happen if the application tries to restart a looper after quit
    // which is not supported.
    final long ptr = mPtr;
    if (ptr == 0) {
        return null;
    }

    int pendingIdleHandlerCount = -1; // -1 only during first iteration
    int nextPollTimeoutMillis = 0;
    for (;;) {
        if (nextPollTimeoutMillis != 0) {
            Binder.flushPendingCommands();
        }

        nativePollOnce(ptr, nextPollTimeoutMillis);

        synchronized (this) {
            //......

            // If first time idle, then get the number of idlers to run.
            // Idle handles only run if the queue is empty or if the first message
            // in the queue (possibly a barrier) is due to be handled in the future.
            if (pendingIdleHandlerCount < 0
                && (mMessages == null || now < mMessages.when)) {
                pendingIdleHandlerCount = mIdleHandlers.size();
            }
            if (pendingIdleHandlerCount <= 0) {
                // No idle handlers to run.  Loop and wait some more.
                mBlocked = true;
                continue;
            }

            if (mPendingIdleHandlers == null) {
                mPendingIdleHandlers = new IdleHandler[Math.max(pendingIdleHandlerCount, 4)];
            }
            mPendingIdleHandlers = mIdleHandlers.toArray(mPendingIdleHandlers);
        }

        // Run the idle handlers.
        // We only ever reach this code block during the first iteration.
        for (int i = 0; i < pendingIdleHandlerCount; i++) {
            final IdleHandler idler = mPendingIdleHandlers[i];
            mPendingIdleHandlers[i] = null; // release the reference to the handler

            boolean keep = false;
            try {
                keep = idler.queueIdle();
            } catch (Throwable t) {
                Log.wtf(TAG, "IdleHandler threw exception", t);
            }

            if (!keep) {
                synchronized (this) {
                    mIdleHandlers.remove(idler);
                }
            }
        }

        // Reset the idle handler count to 0 so we do not run them again.
        pendingIdleHandlerCount = 0;

        // While calling an idle handler, a new message could have been delivered
        // so go back and look again for a pending message without waiting.
        nextPollTimeoutMillis = 0;
    }
}

在 add 方法内拿到被添加的 IdleHandler 后，我们就可以监控 queueIdle 方法执行的时间了，代码片段：

static class MyArrayList<E> extends ArrayList {
    @Override
    public boolean add(Object o) {
        if (o instanceof MessageQueue.IdleHandler) {
            super.add(new MyIdleHandler((MessageQueue.IdleHandler)o));
        }
        return super.add(o);
    }
}

static class MyIdleHandler implements MessageQueue.IdleHandler {
    private final MessageQueue.IdleHandler idleHandler;
    MyIdleHandler(MessageQueue.IdleHandler idleHandler) {
        this.idleHandler = idleHandler;
    }
    @Override
    public boolean queueIdle() {
        // 监控 idleHandler.queueIdle() 耗时即可
        return this.idleHandler.queueIdle();
    }
}

同步屏障卡死监控

我们可以定时的通过反射去拿 MessageQueue 的 mMessages，如果发现 mMessages.target=null，并且 mMessages.when 已经很长时间了，就有可能发生同步屏障消息泄漏了，这时我们可以再主动向主线程Looper 发送一个同步消息和一个异步消息，如果同步消息无法执行，但异步消息被处理，这时基本可以确定泄漏了。

我们可以通过反射去 removeSyncBarrier(token)，其中token 为 mMessages.arg1。

Choreographer FrameCallback

Android 从4.1开始加入 Choreographer 用于同 VSync 机制配合，实现统一调度绘制界面。我们可以设置 Choreographer 类的 FrameCallback 函数，当每一帧被渲染时会触发 FrameCallback 回调，FrameCallback 回调 doFrame(long frameTimeNanos) 函数，一次界面渲染会回调 doFrame，如果两次 doFrame 间隔大于16.6ms 则发生了卡顿。而 1s 内有多少次 callback，就代表了实际的帧率。

Choreographer.getInstance().postFrameCallback(new Choreographer.FrameCallback() {
    @Override
    public void doFrame(long frameTimeNanos) {
        // 这里可以统计相邻间隔，判断卡顿，也可以统计doFrame 帧率
        Choreographer.getInstance().postFrameCallback(this);
    }
});

这种方式优点：使用简单，不仅支持卡顿监控，还支持计算帧率。缺点就是：需要另开子线程来获取堆栈信息，会消耗部分系统资源。

插桩的方式对函数的出入口进行记录

在 Android 的编译流程中，在 class 文件编译成 dex 之前，我们可以通过 plugin 提供的 Transform 机制，来对编译好的 class 文件进行二次处理，每个Transform 的输出作为下个 Transform 的输入，从而对字节码进行改造。推荐使用 ASM，具体的插桩方法就不在这里说了，后续有机会介绍。

插桩的目的在于：对函数的出入口进行记录，包括动作、方法名称、时间戳，方便我们统计耗时和还原调用栈，

这种方式的优点就是：可以溯源，其他方式都需要获取卡顿的堆栈和各种必要信息，这块要做好太不容易了，缺点就是：项目的数据量、运算量、IO瓶颈都应该纳入考量之中，当然这里说的都比较空啊，技术调研和实际实现总是有差距的。

在插桩的覆盖面上，我们可以有选择的插，避免大量插桩造成CPU的消耗：

可以排除掉不需要的三方库和系统库。
可以过滤掉一些非常简单的函数。
过滤编译器自动生成的代码