By: John Blair, Netflix Accomplice Engineering
The Netflix utility runs on a complete lot of trim TVs, streaming sticks and pay TV home top boxes. The role of a Accomplice Engineer at Netflix is to abet software producers originate the Netflix utility on their units. In this text we focus on one seriously advanced field that blocked the originate of a software in Europe.
Towards the tip of 2017, I became on a conference call to focus on a downside with the Netflix utility on a new home top box. The box became a new Android TV software with 4k playback, per Android Inaugurate Offer Challenge (AOSP) model 5.0, aka “Lollipop”. I had been at Netflix for a few years, and had shipped extra than one units, however this became my first Android TV software.
All four avid gamers desirous relating to the software were on the call: there became the immense European pay TV firm (the operator) launching the software, the contractor integrating the home-top-box firmware (the integrator), the system-on-a-chip provider (the chip dealer), and myself (Netflix).
The integrator and Netflix had already accomplished the rigorous Netflix certification direction of, however all over the TV operator’s inner trial an executive on the firm reported a severe field: Netflix playback on his software became “stuttering.”, i.e. video would play for a truly immediate time, then live, then commence up over again, then live. It didn’t happen the entire time, however would reliably commence up to happen within a few days of powering on the box. They supplied a video and it regarded terrible.
The software integrator had found one contrivance to reproduce the sphere: many instances commence up Netflix, commence up playback, then return to the software UI. They supplied a script to automate the technique. Usually it took as lengthy as 5 minutes, however the script would frequently reliably reproduce the worm.
Meanwhile, a field engineer for the chip dealer had diagnosed the foundation reason: Netflix’s Android TV utility, referred to as Ninja, became no longer delivering audio records rapidly ample. The stuttering became attributable to buffer starvation in the software audio pipeline. Playback stopped when the decoder waited for Ninja to raise extra of the audio circulation, then resumed once extra records arrived. The integrator, the chip dealer and the operator all thought the sphere became identified and their message to me became resolute: Netflix, it’s seemingly you’ll maybe well want gotten a worm to your utility, and it’s crucial to repair it. I could perchance maybe well well hear the stress in the voices from the operator. Their software became late and working over budget and so they anticipated outcomes from me.
I became skeptical. The same Ninja utility runs on hundreds and hundreds of Android TV units, including trim TVs and other home top boxes. If there became a worm in Ninja, why is it handiest going on on this software?
I started by reproducing the sphere myself using the script supplied by the integrator. I contacted my counterpart on the chip dealer, requested if he’d seen anything else esteem this ahead of (he hadn’t). Next I started finding out the Ninja provide code. I wished to safe the particular code that delivers the audio records. I identified plenty, however I started to lose the home in the playback code and I wished abet.
I walked upstairs and located the engineer who wrote the audio and video pipeline in Ninja, and he gave me a guided tour of the code. I spent some quality time with the availability code myself to esteem its working parts, adding my possess logging to substantiate my working out. The Netflix utility is complex, however at its easiest it streams records from a Netflix server, buffers several seconds rate of video and audio records on the software, then delivers video and audio frames one-at-a-time to the software’s playback hardware.
Let’s grasp a second to talk relating to the audio/video pipeline in the Netflix utility. The whole lot up till the “decoder buffer” is the similar on every home top box and trim TV, however transferring the A/V records into the software’s decoder buffer is a software-explicit routine working in its possess thread. This routine’s job is to grab care of the decoder buffer paunchy by calling a Netflix supplied API which offers the next frame of audio or video records. In Ninja, this job is performed by an Android Thread. There is a easy tell machine and a few common sense to address assorted play states, however below traditional playback the thread copies one frame of facts into the Android playback API, then tells the thread scheduler to wait 15 ms and invoke the handler over again. If you net an Android thread, it’s seemingly you’ll maybe well well also request that the thread be scramble many instances, as if in a loop, then over again it’s a long way the Android Thread scheduler that calls the handler, no longer your possess utility.
To play a 60fps video, the very perfect frame rate accessible in the Netflix catalog, the software must render a new frame every 16.66 ms, so checking for a new sample every 15ms is actual mercurial ample to raise sooner than any video circulation Netflix can present. Since the integrator had identified the audio circulation as the sphere, I zeroed in on the explicit thread handler that became delivering audio samples to the Android audio carrier.
I wished to retort this are expecting of: the achieve is the further time? I thought some characteristic invoked by the handler would be the wrongdoer, so I sprinkled log messages all over the handler, assuming the guilty code would be obvious. What became soon obvious became that there became nothing in the handler that became misbehaving, and the handler became working in a few milliseconds even when playback became stuttering.
Within the tip, I targeted on three numbers: the tempo of facts transfer, the time when the handler became invoked and the time when the handler passed alter aid to Android. I wrote a script to parse the log output, and made the graph below which gave me the retort.
The orange line is the tempo that records moved from the streaming buffer into the Android audio system, in bytes/millisecond. You will want the option to safe three decided behaviors in this chart:
- The two, tall spiky parts the achieve the records rate reaches 500 bytes/ms. This phase is buffering, ahead of playback begins. The handler is copying records as mercurial as it ought to.
- The home in the center is traditional playback. Audio records is moved at about 45 bytes/ms.
- The stuttering home is on the perfect, when audio records is transferring at closer to 10 bytes/ms. Right here’s no longer mercurial ample to grab care of playback.
The unavoidable conclusion: the orange line confirms what the chip dealer’s engineer reported: Ninja is just not any longer delivering audio records rapidly ample.
To attach why, let’s uncover what narrative the yellow and grey lines expose.
The yellow line presentations the time spent in the handler routine itself, calculated from timestamps recorded on the tip and the bottom of the handler. In every traditional and exclaim playback areas, the time spent in the handler became the similar: about 2 ms. The spikes point to cases when the runtime became slower ensuing from time spent on other projects on the software.
The grey line, the time between calls invoking the handler, tells a uncommon narrative. Within the conventional playback case it’s seemingly you’ll maybe well well also uncover the handler is invoked about every 15 ms. Within the exclaim case, on the perfect, the handler is invoked approximately every 55 ms. There are an further 40 ms between invocations, and there’s no contrivance that can maybe well establish up with playback. But why?
I reported my discovery to the integrator and the chip dealer (witness, it’s the Android Thread scheduler!), however they persevered to push aid on the Netflix behavior. Why don’t you correct copy extra records every time the handler is referred to as? This became an perfect criticism, however altering this behavior eager deeper adjustments than I became spirited to create, and I persevered my seek the foundation reason. I dove into the Android provide code, and realized that Android Threads are a userspace net, and the thread scheduler makes spend of the epoll() system call for timing. I knew epoll() efficiency isn’t guaranteed, so I suspected something became affecting epoll() in a systematic contrivance.
At this point I became saved by one other engineer on the chip dealer, who found a worm that had already been mounted in the next model of Android, named Marshmallow. The A