Reprojecting the Perseverance landing footage onto satellite imagery

Created by
Matthew Earl
on March 06, 2021.

Header pic


The landing of the Mars 2020 Perseverance rover final month amazed the arena.
The enticing photos of the descent exhibits every stage of the sequence.
Whereas you happen to could presumably even possess no longer viewed it already
you’ll also peek it right here.

One thing that I realized noteworthy was once the self-similarity of the martian
terrain. As the lander descends in direction of the ground it is annoying to catch a
sense of scale, since there may per chance be no longer any familiar frame of reference to make clear us how far
away the ground is. This led me to embark on a mission in which I
reproject the photos onto a satellite image acquired from the
Mars Teach Orbiter,
along with a scale to make clear us how neat parts on the ground truly are:

In this post I’m going to make clear how I venerable Python, OpenCV, PyTorch, and
Blender to compose the above photos.

Keypoints and correspondences

Producing my video entails distorting the frames of the authentic photos so
that every frame lines up with the satellite image. The neatly-liked attain of doing this
is to:

  • fabricate some salient keypoints from every image
  • procure correspondences between the aspects
  • procure a mathematical purpose that maps aspects in the first image to those in the second image.

The crucial aspects of implementing the above are described in
this OpenCV tutorial, but I may summarize the blueprint right here.

Breaking this down, on the left is a frame from the video that we want to align,
with the reference satellite image on the lawful:

images to align

First and predominant, we say OpenCV’s
Scale Invariant Feature Remodel (SIFT) keypoint
detector to pull out salient keypoints from the image:

images with keypoints indicated

Each crimson imperfect right here marks a potentially “lively” level as definite by the
SIFT algorithm. Related to every level (but no longer proven) is a vector of 128
values which describes the section of the image that surrounds the keypoint. The
conception is that this descriptor is invariant to things esteem scale (as the name
implies), rotation, and lights differences. We can then match up aspects in our
pair of photography with connected descriptors:

images with lines between corresponding keypoints

Projective transformations

Now that now we possess realized the keypoint pairs, your next step is to procure a transformation
that maps the keypoints from the video frame onto the corresponding keypoints of
the satellite image. To originate this, we compose say of a class of transformations known
as projective transformations. Projective transformations could presumably also moreover be venerable to
characterize how mounted aspects on a flat plane alternate obvious situation when considered
from a slightly a pair of space and perspective, which is safe to us for the reason that ground of Mars could presumably also moreover be effectively
approximated by a flat plane at these distances. That is assuming that the digicam conforms to a rectilinear perspective projection (i.e. without lens distortion), which appears to be like to be the case.

A projective transformation is represented by a 3×3 matrix (M). To prepare this form of
transformation to a 2D level (v) we first append a 1 to present a 3-vector, then
multiply by the matrix:

[v’=M begin{bmatrix}
v_x \
v_y \

To catch relief to a 2D level, the result is split by its
third ingredient, and truncated relief to a 2-vector:

v’_x / v’_z \
v’_y / v’_z

This would maybe presumably also moreover be visualized by
plotting the aspects on the z=1 plane, applying the transformation, and then
projecting every level in direction of the origin, relief onto the z=1 plane:

When we talk about composing projective transformations, what we are truly doing is
multiplying the underlying matrices: projective transformations possess the property
that the composition of two transformations is the identical as the projective transformation
given by the matrix made from their respective matrices. Written symbolically
this may per chance be written as

[forall x in mathbb{R}^2 colon p_{M_1} ( p_{M_2} (x) )=
p_{M_1 M_2} (x)]

where (p_{M}) denotes the projective transformation connected to the 3×3
matrix (M).

Discovering the transformation is executed utilizing a
RANSAC attain.
For more crucial aspects on RANSAC please witness
my Pluto flyby post.

Once now we possess a transformation for every frame, we can reproject every video frame
into the frame of reference of the satellite image, thus acquiring the stablized

Discovering transformations for every frame

Sadly it is no longer simply a case of repeating the above course of for every
frame in relate to manufacture a total video, because the algorithm is no longer ready to
fabricate enough correspondences for every frame.

In relate to solve this, we also see transformations between the video frames
themselves. The thought being that if a frame has no insist transformation linking
it to a satellite image, but we originate possess a transformation linking it to a different
frame that is itself linked to the satellite image, then we can simply
originate the two transformations to blueprint the authentic frame onto the satellite peek.

So, I labelled every thirtieth frame (ie. one frame per second) as a “keyframe”,
and then exhaustively hunted for transformations between every pair of
keyframes. For the relaxation frames I hunted for transformations to the
nearest keyframe.

This outcomes in a sexy dense graph with one node per frame, and one edge per
transformation realized. Here’s a simplified example, with keyframes at every 5
frames in would like to at every 30:

graph showing connections between frames

Any path from the satellite node to a convey frame’s node represents a series
of transformations that once silent will blueprint the frame onto the satellite

We can inaugurate by deciding on one path for every node. Doing a breadth-first

from the satellite node will give us a path to every frame whereas also
guaranteeing that it is the shortest that you simply’ll also have faith:

previous graph but with shortest paths highlighted

We wish the shortest path that you simply’ll also have faith, because little errors procure with every
extra transformation.

Here’s a transient clip made utilizing shortest path transformations:


Whereas the above blueprint yields a tight reprojection, it is no longer supreme. There
are clear mode switches round when the shortest path modifications.

If we incorporate all correspondences, and no longer actual these on the shortest path,
this provides more files and outcomes in smoother and more pleasing

To originate this, I wrote a loss purpose which returns the final reprojection error,
given a satellite-relative transformation for every image:

    def mission(v): 
        # Mission onto the plane z=1
        return v / v[..., -1][..., None]

    def loss(frame_transforms, src_pts, dst_pts, src_idx, dst_idx): 
        M_src_inv = torch.inverse(frame_transforms)[src_idx]
        M_dst = frame_transforms[dst_idx]
        ref_pts = torch.einsum('nij,nj->ni', M_src_inv, src_pts)
        reprojected_dst_pts = mission(torch.einsum('nij,nj->ni', M_dst, ref_pts))

        return torch.dist(reprojected_dst_pts, dst_pts)

src_pts and dst_pts are both N x 3 arrays, representing every pair of
aspects in the dataset. frame_transforms is an M x 3 x 3 array representing
the candidate transformations, M being the different of frames in the video.
frame_transforms are relative to the satellite image, which is to affirm a level
in the satellite image when transformed with frame_transforms[i] ought to composed
give the corresponding level in frame i.

Since there are loads of level-pairs per frame, src_idx and dst_idx are venerable
to blueprint every half of of each level-pair to the corresponding video frame.

The loss purpose proceeds by taking the first aspects from every pair, mapping
them relief into the satellite image’s frame of reference, then mapping them into
the frame of reference of the second image. With pleasing frame transformations and
supreme correspondences, these transformed aspects desires to be very stop to the
corresponding impart of second aspects. The supreme line of the loss purpose then
measures the Euclidean distance (sum of squares) between the reprojected first
aspects and the (unmodified) second aspects. The thought is that if we procure a impart of
frame_transforms with a decrease loss, then we are going to possess a more pleasing impart of

loss is written utilizing Torch. Torch is an computerized
differentiation framework with functionality for applying gradient
(amongst other things).
As such we can say it to iteratively toughen our frame_transforms:

    src_pts, dst_pts, src_idx, dst_idx = dataset
    frame_transforms = initial_frame_transforms

    optim = torch.optim.Adam([frame_transforms], lr=1e-5)
    whereas Trusty: 
        l = loss(frame_transforms, src_pts, dst_pts, src_idx, dst_idx)

dataset is constructed from the impart of correspondences, and the
initial_frame_transforms are these derived from composing the transformations
along the shortest paths.

After running this loop for a whereas we obtain the supreme impart of transformations
for every frame. This produces a more accurate impart of transformations:


To fabricate the supreme video I venerable the 3D modelling and rendering application
Blender. I venerable Blender’s
rich Python scripting interface to animate a quad whose corners prepare the
reprojected video’s corners. To catch the lawful texture for the quad I took
again of Blender’s shader scheme:

screenshot of blender shader

In frequent, the shader scheme decides how a convey level on a ground ought to composed
be dark, which is commonly a purpose of incoming light, peek course, and
properties of the ground. Here I’m utilizing it in a in point of fact easy attain which
calculates what colour the level on the quad desires to be, given the level’s
coordinates in 3D space.

Here’s a breakdown of the slightly a pair of stages:

  1. Be pleased the impart of the relate be colored, and change the Z component
    with a 1. That is the first stage of the projective transformation where we
    flip the two-vector right into a 3-vector by appending a one.
  2. Multiply this 3-vector by a matrix outlined by the constants proven right here. These
    constants are genuinely intriguing so that on any given frame these level to
  3. Divide via by z (mission onto the z=1 plane).
  4. At this level the coordinates are when it involves pixels in the video frame.
    On the different hand the next stage wants them to be in the fluctuate 0 to 1, so divide by the
    video width and peak right here.
  5. Look up the given coordinates in the video, and output the corresponding

Final touches

There are a pair of extra aspects that wanted addressing to manufacture the supreme video:

  • I venerable many satellite photography in would like to actual one. On the different hand, I designate one
    as the “reference frame” (ie. the frame with the identity transformation) and
    care for the comfort as in the occasion that they had been video key frames.
  • For the length of the early section of the video, the rover’s heatshield is visible. With out
    intervention, some frame correspondences tune the heatshield (which is itself
    transferring) in would like to the terrain, causing crude tracking. So, I manually
    extracted some keypoints from the heatshield on a convey frame, and
    omitted all keypoints that had been corresponding to at the least thought to be one of many heatsheid’s
  • Infrequently ever, degenerate frame correspondences are realized. When all matching
    keypoints are in a line you catch loads of solutions corresponding to rotations
    about that line. Even though matching keypoints are no longer precisely in a line but are
    stop, the transformation realized could presumably also moreover be wrong. There was once one such image
    pair that precipitated this predicament in my video, which I manually excluded.


I possess proven that the photos from the Perseverance rover’s descent could presumably also moreover be
stablized and aligned with a reference satellite image. Whereas I’m fully delighted with
the result and it no doubt helps give context to the raw photos, there are
many suggestions that it is also improved, as an instance, for the length of the early section of the
video there are no longer many keypoints realized by SIFT. This manifests itself as
inaccuracy in the tracking. Perchance experimenting with slightly a pair of keypoint
algorithms would yield more usable keypoints.

There is also also different routes to solve the predicament which I possess no longer
explored right here. As an instance, the predicament is slightly corresponding to that of frequent
video stabilization. Perchance I could presumably also say an off-the-shelf solver to compose a
connected attain.

Read More

Recent Content