Low Latency FPV Streaming with the Raspberry Pi

One of the biggest challenges in running my local quadcopter racing group has been overlapping video channels. The manual for the transmitters list a few dozen channels, but in practice, only four or five can be used at the same time without interference.

The transmitters being used right now are analog. Digital video streams could make much more efficient use of the spectrum, but this can introduce latency. Of late, I’ve been noticing a few posts around /r/raspberry_pi about how to do an FPV stream with an RPi, and I’ve been doing some experiments along these lines, so I thought it was a good time to share my progress.

It’s tempting to jump right into HD resolutions. Forget about it; it’s too many pixels. Fortunately, since we’re comparing to analog FPV gear, we don’t need that many pixels to be competitive. The Fatshark Dominator V3s are only 800×600, and that’s a $600 set of goggles.

You’ll want to disable wireless power management. This will tend to take the wireless interface up and down a lot, introducing a delay each time. It’s not saving you that much power; consider that an RPi takes maybe 5 watts, while on a 250-sized quadcopter, the motors can easily take 100 watts or more each. So shut that off by adding this anywhere in /etc/network/interfaces:

wireless-power off

And reboot. That should take care of that. Check the output of iwconfig to be sure. You should see a line that says “Power Management:off”.

You’ll want to install GStreamer 1.0 with the rpicamsrc plugin. This lets you take the images directly off the RPi camera module, without having to use shell pipes to have raspivid to go into GStreamer, which would introduce extra lag.

With GStreamer and its myriads of plugins installed, you can start this up on the machine that will show the video:

gst-launch-1.0 udpsrc port=5000 \
    ! gdpdepay \
    ! rtph264depay \
    ! avdec_h264 \
    ! videoconvert \
    ! autovideosink sync=false

This will listen on UDP port 5000, waiting for an RTSP h.264 stream to come in, and then automatically display it by whatever means works for your system.

Now start this on the RPi:

gst-launch-1.0 rpicamsrc bitrate=1000000 \
    ! 'video/x-h264,width=640,height=480' \
    ! h264parse \
    ! queue \
    ! rtph264pay config-interval=1 pt=96 \
    ! gdppay \
    ! udpsink host=[INSERT_IP_ADDR] port=5000

Modify that last line to have the IP address of the machine that’s set to display the stream. This starts grabbing 640×480 frames off the camera with h.264 encoding, wraps them up in the RTSP protocol, and sends them out.

On a wireless network with decent signal and OK ping times (80ms average over 100 pings), I measured about 100ms of video lag. I measured that by displaying a stop watch on my screen, and then pointing the camera at that and taking a screenshot:


This was using a RPi Model 2, decoding on a fairly modest AMD A8-6410 laptop.

I’d like to tweak that down to the 50-75ms range. If you’re willing to drop some security, you could probably bring down the lag a bit by using an open WiFi network.

I’ll be putting together some estimates of bandwidth usage in another post, but suffice it to say, a 640×480@30fps stream comes in under 2Mbps with decent quality. There will be some overhead on that for things like frame and protocol headers, but that suggests a 54Mbps wireless connection will take over 10 people no problem, and that’s on just one WiFi channel.

The Performance of Seperating Control and Video Display Processes in UAV::Pilot

I’ve been running some crude benchmarks of the UAV::Pilot video timing. As I went over in my last post, I’m planning on having the video be read from the network in one process, and have it piped out to another process for decoding and display.

I added logging statements that show the exact time (using Time::HiRes::gettimeofday()) that a video packet comes in, and then another log for when we display it on the SDL window.

The first benchmark used the existing uav_video_display that’s in the UAV::Pilot distribution, reading from the file ardrone_video_stream_dump.bin. This file is in the UAV::Pilot::ARDrone distribution and is a direct dump of the network stream from an AR.Drone’s h.264 video port. It’s primarily used to run some of the video parsing tests in that distro.

I found that on my laptop, there was a delay of 12.982ms between getting the video frame and actually displaying it. At 60fps, there is a delay of 16.667ms between each frame, so this seems quite acceptable. The AR.Drone only goes up to 30fps, anyway, but it’s nice to know we have some leeway for future UAVs.

I then implemented a new script in the UAV::Pilot::ARDrone distro that read the same video frames from STDIN. I had planned on doing this with the same file noted above, like this:

cat ardrone_video_stream_dump.bin | ardrone_display_video.pl

But this ended up displaying only the last frame of video.

My theory on why this happens is that we use AnyEvent for everything, including reading IO and telling SDL when to display a new window. Using cat like that, there’s always more data for the AnyEvent->io watcher to grab, so SDL never gets a chance until the pipe is out of data. At that point, it still has the last frame in memory, so that’s what it displays.

I tried playing around with dd instead of cat, but got the same results.

So I broke down and connected to the actual AR.Drone with nc:

nc 5555 | ardrone_display_video.pl

Which did the trick. This does mean that the results are not directly comparable to each other. We can still run the numbers and make sure the delay remains insignificant, though.

And indeed it did. It averaged out to 13.025ms. That alleviates my concern that using a pipe would introduce a noticeable delay and things can go right ahead with this approach.

Thinking out Loud: Managing Video and Nav Display Together in UAV::Pilot

I’ve been going back into the various UAV::Pilot distros and trying to figure out how to best approach putting video and nav data together. Ideally, navigation would overlay information directly, with a standalone nav display perhaps being an option.

That doesn’t work, because the video uses a YUV overlay in SDL to spit all the pixels to screen at once. Because of whatever hardware magic SDL does to make this work, drawing on top of those pixels has a nasty flicker effect. SDL2 might solve this, but there hasn’t been much movement on the Perl bindings in a number of months.

Using OpenGL to use the YUV video data as a texture might also solve this, and I suspect it’s the way to go in the long term. Trouble is, Perl’s OpenGL docs are lacking. They seem to assume you already have a solid grounding in how to use OpenGL in C, and you just want to move over to Perl. I messed with OpenGL ES on Android (Java) a while back, but I’m more or less starting fresh. Still, working through an OpenGL book in C might be a good exercise, and then I can revisit this in Perl.

(If anybody else wants to take up the OpenGL stuff, I would wholeheartedly endorse it.)

It would be nice if SDL let you create two separate windows in the same process, but it doesn’t seem to like that.

The trick that’s already implemented in the full releases is to take an SDL window and subdivide it into different drawing areas. This meant implementing a half-assed layout system. It also ended up breaking in the last release, as I called SDL::Video::update_rect() on the whole window, which caused nasty visual issues with the YUV overlay.

That part is fixed now by only updating parts of the layout that want to be updated. Now the problem is that displaying the nav and video together causes a half-second or so lag in the video. This is unacceptable in what should be a real-time output.

I think the way to go will be to fork() off and display the video and nav in separate processes. The central process will manage all incoming data from the video and nav network sockets, and then pipe it to its children. Then there are separate SDL windows in each process. The UAV::Pilot::SDL::Window interface (that half-assed layout system) will probably still be implemented, but will be effectively vestigial for now.

This might mean parsing the nav stream redundantly in both the master process and the nav process. There are still things in the master process that would need the nav data. But it’s probably not a big deal.

It’ll also mean all the video processing can be done on a separate CPU core, so that’s cool.

Another benefit: currently, closing the SDL window when using the uav shell will exit the whole shell. There are probably some SDL parameters I could play with to fix this, but with separate processes, this is no longer a problem.

How UAV::Pilot got Real Time Video, or: So, Would You Like to Write a Perl Media Player?

Real-time graphics isn’t something people normally do in Perl, and certainly not video decoding. Video decoding is too computation-intensive to be done in pure Perl, but that doesn’t stop us from interfacing to existing libraries, like ffmpeg.

The Parrot AR.Drone v2.0 has an h.264 video stream, which you get by connecting to TCP port 5555. Older versions of the AR.Drone had its own encoding mechanism, which their SDK docs refer to as “P.264”, and which is a slight variation on h.264. I don’t intend to implement the older version. It’s for silly people.

Basics of h.264

Most compressed video works by taking an initial “key frame” (or I-frame), which is the complete data of the image. This is followed by several “predicted frames” (or P-frame), which hold only the differences compared to the previous frame. If you think about a movie with a simple dialog scene between two characters, you might see a character on camera not moving very much except for their mouth. This can be compressed very efficiently with a single big I-frame and lots of little P-frames. Then the camera switches to the other character, at which point a good encoder will choose to put in a new I-frame. You could technically keep going with P-frames, but there are probably too many changes to keep track of to be worth it.

Since correctly decoding a P-frame depends on getting all the frames back to the last I-frame right, it’s a good idea for encoders to throw in a new I-frame on a regular basis for error correction. If you’ve ever seen a video stream get mangled for a while and then suddenly correct itself, it’s probably because it hit a new I-frame.

(One exception to all this is Motion JPEG, which, as the name implies, is just a series of JPEG images. These tend to have a higher bitrate than h.264, but are also cheaper to decode and avoid having errors affect subsequent frames.)

If you’ve done any kind of graphics programming, or even just HTML/CSS colors, then you know about the RGB color space. Each of the Red, Green, and Blue channels gets 8 bits. Throw in an Alpha (transparency) channel, and things fit nice into a 32 bit word.

Videos are different. They use the “YCbCr” color space, at term which is sometimes used interchangeably with “YUV”. The “Y” is luma, while “Cb” and “Cr” is blue and red, respectively. There are bunch of encoding variations, but the most important one for our purposes is YUV 4:2:2.

The reason this is that YUV can do a clever trick where it sends the Y channel on every pixel on a row, but only sends the U and V channels on every other pixel. So where RGB has 24 bits per pixel (or 32 for RGBA), YUV averages to only 16.

The h.264 format internally stores things in YUV 4:2:2, which corresponds to SDL::Overlay‘s flag of SDL_YV12_OVERLAY.

Getting Data From the AR.Drone

As I said before, the AR.Drone sends the video stream over TCP port 5555. Before getting the h.264 frame, a “PaVE” header is sent. The most important information in that header is the packet size. Some resolution data is nice, too. This is all processed in UAV::Pilot::Driver::ARDrone::Video.

The Video object can take a list of objects that do the role UAV::Pilot::Video::H264Handler. This role requires a single method to be implemented, process_h264_frame(), which is passed the frame and some width/height data.

The first object to do that role was UAV::Pilot::Video::FileDump, which (duh) dumps the frames to a file. The result could be played on VLC, or encoded into an AVI with mencoder. This is as far as things got for UAV::Pilot version 0.4.

(In theory, you should have been able to play the stream in real time on Unixy operating systems by piping the output to a video player that can take a stream on STDIN, but it never seemed to work right for me.)

Real Time Display

The major part of version 0.5 was to get the real time display working. This meant brushing up my rusty C skills and interfacing to ffmpeg and SDL. Now, SDL does have Perl bindings, but they aren’t totally suitable for video display (more on that later). There are also two major bindings to ffmpeg on CPAN: Video::FFmpeg and FFmpeg. Neither was suitable for this project, because they both rely on having a local file that you’re processing, rather than having frames in memory.

Fortunately, the ffmpeg library has an excellent decoding example. Most of the xs code for UAV::Pilot::Video::H264Decoder was copy/pasted from there.

Most of that code involves initializing ffmpeg’s various C structs. Some of the most important lines are codec = avcodec_find_decoder( CODEC_ID_H264 );, which gets us an h.264 decoder, and c->pix_fmt = PIX_FMT_YUV420P;, which tells ffmpeg that we want to get data back in the YUV 4:2:2 format. Since h.264 stores in this format internally, this will keep things fast.

In process_h264_frame(), we call avcodec_decode_video2() to decode the h.264 frame and get us the raw YUV array. At this point, the YUV data is in C arrays, which are nothing more than a block of memory.

High-level languages like Perl don’t work on blocks of memory, at least not in ways that the programmer is usually supposed to care about. They hold variables in a more sophisticated structure, which in Perl’s case is called an ‘SV’ for scalars (or ‘AV’ for array, or ‘HV’ for hashes). For details, see Rob Hoelz’s series on Perl internals, or read perlguts for all the gory details.

If we wanted to process that frame data in Perl, we would have iterate through the three arrays (one for each YUV channel). As we go, we would put the content in an SV, then push that SV onto an AV. Those AVs can then be passed back from C and into Perl code. The function get_last_frame_pixels_arrayref() handles this conversion, if you really want to do that. Protip: you really don’t want to do that.

Why? Remember that YUV sends Y for every pixel in a row, and U and V for every other pixel, for an average of 2 bytes per pixel, and therefore 2 SVs per pixel (again, on average). If we assume a resolution of 1280×720 (720p), then there are 921,600 pixels, or 1,843,200 SVs to create and push. You would need to do this 25-30 times per second to keep up with a real time video stream, on top of the video decoding and whatever else the CPU needs to be doing while controlling a flying robot.

This would obviously be too taxing on the CPU and memory bandwidth. My humble laptop (which has a AMD Athlon II P320 dual-core CPU) runs up to about 75% CPU usage in UAV::Pilot while decoding a 360p video stream. That laptop is starting to show its age, but it’s clear that the above scheme would not work even on newer and beefier machines.

Fortunately, there’s a little trick that’s hinted at in perlguts. The SV struct is broken down into more specific types, like SViv. The trick is that the IV type is guaranteed to be big enough to store a pointer, which means we can store a pointer to the frame data in an SV and then pass it around in Perl code. This means that instead of 1.8 million SVs, we make just one for holding a pointer to the frame struct.

This trick is pretty common in xs modules. If you’ve ever run Data::Dumper on a XML::LibXML node, you may have noticed that it just shows a number. That number is actually a memory address that points to the libxml2 struct for that particular DOM node. The SDL bindings also do this.

The tradeoff is that the data can never be actually processed by Perl, just passed around between one piece of C code to another. The method get_last_frame_c_obj() will give you those pointers for passing around to whatever C code you want.

This is why SDL::Overlay isn’t exactly what we need. To pass the data into the Perl versions of the overlay pixels() and pitches() methods, we would have to do that whole conversion process. Then, since the SDL bindings are a thin wrapper around C code, it would undo the conversion all over again.

Instead, UAV::Pilot::SDL::Video uses the Perl bindings to initialize everything in Perl code. Since SDL is doing that same little C pointer trick, we can grab the SDL struct for the overlay the same way. When it comes time to draw the frame to the screen, the module’s xs code gets the SDL_Overlay C struct and feeds in the frame data we already have. Actual copying of the data is done by the ffmpeg function sws_scale(), because that’s solution I found, and I freely admit to cargo-culting it.

At this point, it all worked, I jumped for joy, and put the final touches on UAV::Pilot version 0.5.

Where to go From Here

I would like to be able to draw right on the video display, such as to display nav data like the one in this video:


Preliminary work is done in UAV::Pilot::SDL::VideoOverlay (a role for objects to draw things on top of the video) and UAV::Pilot::SDL::VideoOverlay::Reticle (which implements that role and draws a reticule).

The problem I hit is that you can’t just draw on the YUV overlay using standard SDL drawing commands for lines or such. They come up black and tend to flicker. Part of the reason appears to go back to YUV only storing the UV channels on every other pixel, which screws up 1-pixel wide lines 50% of the time. The other reason is that hardware accelerated YUV overlays are rather complicated. Notice that linked discussion thread goes back to 2006, and things don’t appear to have gotten better until maybe just recently with the release of SDL2.

The video frame could be converted to RGB in software, but that would probably be too expensive in real time. The options appear to be to either work it out with SDL2, or rewrite things in OpenGL ES. OpenGL would add a lot more boilerplate code, but could have side benefits for speed on top of just plain working correctly.

Once you can draw on the screen, you could do some other cool things like doing object detection and displaying boxes around those objects. Image::ObjectDetect is a Perl wrapper around the opencv object detection library, though you’ll run into the same problem of copying SVs shown above. Best to use the opencv library directly.