October 10th 2018, our team releases a new version of our React Native app. We're happy and proud to deliver new features to our users.
But oh the horror! A few hours after the release, we're suddenly receiving an increased number of Android crashes.
Our crash reporting tool, Sentry is on fire!!
All the new errors are of the type ++code>JSApplicationIllegalArgumentException Error while updating property 'left' in shadow node of type: RCTView"++/code>.
In React Native, this can usually happen if you set a property with a wrong type. But how come we didn't catch this error when testing the app? Our new releases are well tested by every member of the team on a panel of multiple devices.
Also the errors seem quite random, they seem to happen for any combination of ++code>property++/code> and type of shadow node. These are the first 3 that were occurring for instance:
It also seems to happen on any device and any Android version, based on the Sentry report
A majority of Android 8.0.0 crashes but this is consistent with our user base
So the first step when you want to fix a bug is reproducing, right? Fortunately, thanks to our Sentry logs, we know what the users are doing before triggering the crash.
Soooo let's see here...
Welp, for the vast majority of crashes, users are just opening the app and kaboum, the crash happens.
Okay, let's try to replicate then. With the store app installed on 6 different Android devices, let's try to reproduce by opening and quitting the app a few times. No luck crashing the app! All the more impossible to reproduce locally in dev mode.
All right, this seems pointless. The crashes seem quite random anyway. The crash rate is raising at about 10%. Seems like you basically have 1 chance out of 10 to have the app crash on you when starting it.
To be able to reproduce this crash, let's try to understand it, where it comes from, how it thinks...
Well the thing is, as mentioned before, we have a few different errors. And they all have similar yet not entirely similar stacktraces.
Ok then, let's take the first one and analyse:
++pre>++code>java.lang.ArrayIndexOutOfBoundsException: length=10; index=-1
at android.support.v4.util.Pools$SimplePool.release(Pools.java:116)
at com.facebook.react.bridge.DynamicFromMap.recycle(DynamicFromMap.java:40)
at com.facebook.react.uimanager.LayoutShadowNode.setHeight(LayoutShadowNode.java:168)
at java.lang.reflect.Method.invoke(Method.java)
...
java.lang.reflect.InvocationTargetException: null
at java.lang.reflect.Method.invoke(Method.java)
...
com.facebook.react.bridge.JSApplicationIllegalArgumentException: Error while updating property 'height' in shadow node of type: RNSVGSvgView
at com.facebook.react.uimanager.ViewManagersPropertyCache$PropSetter.updateShadowNodeProp(ViewManagersPropertyCache.java:113)
...
++/code>++/pre>
Ok this is where it's happening then, in ++code>android/support/v4/util/Pools.java++/code>.
Hmm, we're pretty deep into the Android support library. At this point, I'm afraid it would take too much time investigating. I want to find the quickest way to fix the issue for my users, then I'll take more time to investigate the root cause behind the bug.
Another way to find the root cause of the bug is to check the new changes we made in this release. Especially changes that would affect native Android code. 2 hypotheses emerge:
At this point since we cannot reproduce the bug, our best bet is:
But how to choose which library to downgrade?
One solution to decide is playing "head or tail", but do we really want to resolve to that?
Ok, let's dig deeper in the stacktrace we had previously picked, to see if we can determine which library to select.
++pre>++code>++code class="language-java">
/**
* Simple (non-synchronized) pool of objects.
*
* @param The pooled type.
*/
public static class SimplePool implements Pool {
private final Object[] mPool;
private int mPoolSize;
...
@Override
public boolean release(T instance) {
if (isInPool(instance)) {
throw new IllegalStateException("Already in the pool!");
}
if (mPoolSize < mPool.length) {
mPool[mPoolSize] = instance;
mPoolSize++;
return true;
}
return false;
}
++/code>++/pre>
This above was where the crash was occurring. The error was ++code>java.lang.ArrayIndexOutOfBoundsException: length=10; index=-1++/code> meaning ++code>mPool++/code> is an array of size 10, but ++code>mPoolSize=-1++/code>.
All right, how can get there with ++code>mPoolSize=-1++/code>? In addition to the ++code>recycle++/code> method above, the only place where ++code>mPoolSize++/code> is modified is in the ++code>acquire++/code> method of the ++code>SimplePool++/code> class:
++pre>++code>++code class="language-java">
public T acquire() {
if (mPoolSize > 0) {
final int lastPooledIndex = mPoolSize - 1;
T instance = (T) mPool[lastPooledIndex];
mPool[lastPooledIndex] = null;
mPoolSize--;
return instance;
}
return null;
}
++/code>++/pre>
So the only way ++code>mPoolSize++/code> can be ++code>-1++/code> is if hitting the line ++code>mPoolSize--++/code> while ++code>mPoolSize=0++/code>. But how is it possible with the condition above ++code>mPoolSize > 0++/code>?
Let's put a breakpoint in Android Studio and check what's going on when we start the app. I mean, there was an ++code>if++/code> condition, this piece of code could not malfunction!
See, ++code>DynamicFromMap++/code> has a static reference to a ++code>SimplePool++/code>
++pre>++code class="language-java">
private static final Pools.SimplePool<DynamicFromMap> sPool = new Pools.SimplePool<>(10);
++/code>++/pre>
After a few dozen hits of the play button, with carefully placed breakpoints, we can see that ++code>SimplePool.acquire++/code> and ++code>SimplePool.release++/code> are called on the mqt_native_modules threads by React Native to manage style properties of React component (below the ++code>width++/code> of a component)
But they are also called on the main thread!
Above we can see that it is used to update a ++code>fill++/code> prop on the main thread, typically for a ++code>react-native-svg++/code> component! Indeed, it turns out ++code>react-native-svg++/code> was using ++code>DynamicFromMap++/code> only since version 7 of the lib to improve the performance of native svg animations.
Aaaand the function can actually be called from 2 threads but ++code>DynamicFromMap++/code> is not using ++code>SimplePool++/code> in a thread safe manner. "Thread safe", you say? For those of you who don't know the term, I'll explain in the next part. If already know what this means, feel free to skip until "In our case, here's what could happen"
So JavaScript is single threaded, so JavaScript developers usually have no need to deal with thread safety.
Java, on the other hand, supports the concept of concurrent, or multi-threaded, programs. Several threads can run within a single program and can potentially access a common data structure, which can lead to unexpected results.
Let's take a simple example, in the image below, both Thread A and Thread B both:
Thread B could potentially access the data value before Thread A was done updating it. We expected two separate increments, to result in a final value of ++code>19++/code>. What we potentially get instead, was ++code>18++/code>. Such a scenario, where the final state of data depends on the relative order of thread operations, is called a race condition. The problem with race conditions is that they don't necessarily happen all the time. It's possible in the case above that Thread B had more work to do before incrementing the value, giving enough time to Thread A to update the value. That explains the randomness and the impossibility to reproduce our crash.
A data structure is said to be thread safe, if operations can be done concurrently by many threads, without the risk of race conditions.
When one thread is performing read on a particular data element, no other thread should be allowed to modify or delete this element (this is called atomicity). In our earlier example, if the update cycles were atomic, the race condition would have been avoided. Thread B would have waited for Thread A to complete its increment, and then it would have gone ahead.
Since ++code>DynamicFromMap++/code> is holding a static reference to the ++code>SimplePool++/code>, multiple ++code>DynamicFromMap++/code> called from different threads can concurrently call the ++code>acquire++/code> method in ++code>SimplePool++/code>.
In the image above, Thread A is calling the method, evaluating the condition to be true, but hasn't yet decreased the value of ++code>mPoolSize++/code> (which is shared with Thread B) while Thread B is concurrently calling the method and already evaluating the condition which is then true also. Then each separate call would subsequently decrease the value of ++code>mPoolSize++/code> and that's how you get an impossible value.
Looking into how to fix it, we found a PR on react-native that was not merged making it thread safe.
We then deployed a patched version of react native rolling it out to our users. The crash was finally fixed, hooray!
Finally, thanks to the help of Janic Duplessis (React Native core contributor), and Mikael Sand (++code>react-native-svg++/code> maintainer), the fix will be included in the next ++code>0.57++/code> minor for React Native
This bug took us some effort to fix but was a great opportunity to dig deeper into ++code>react-native++/code> and ++code>react-native-svg++/code>. A good debugger and a few well placed breakpoints go a long way. Hope you learned something useful as well!
[EDIT 02/04/2019] For French readers: Check out my talk on the subject @ParisJS! ?