Yohan Chalier Blog RSS

Datamoshing

Content

I once saw a very intriguing YouTube video: Les peintures numériques de Jacques Perconte, on Tracks, a show produced by the Franco-German television channel Arte and dedicated to emergent forms of art. In the video, the French artist Jacques Perconte exhibits his work, that relies on data moshing. He encodes videos in such a way that he loses data without corrupting the file, resulting in mesmerizing video artifacts. And he plays with that to merge several pictures in evocating ways.

Extracted from Avant l'effondrement du Mont Blanc (2020)

I found that amazing, and wanted to try it myself. So first, I tried to replicate it.

A First Attempt at Compressionism

To get an understanding of the principle, one must know a little bit about video encoding.

A Quick Introduction to Video Encoding

Basically, a video is a sequence of still images, called frames. You can take a series of JPGs and concatenate them into a video file, for instance, using FFmpeg:

ffmpeg -framerate 30 -i frame_%03d.jpg -c:v libx264 -pix_fmt yuv420p output.mp4

But storing each individual frame of a video is an insanely inefficient solution. For a one minute HD video at 30 frames per second and 8 bit color depth, the file size would be more than 10 Gb. Instead, video files are compressed, using the fact that in a real video, the next frame remains close to the previous one. So, if you know one frame, you just have to store the difference with the next one, and re-compute it on the fly when playing, and then do the same for the next frame, and so on. When the scene changes, you can store a new reference frame, and compute differences from that new starting point.

So here is where video compression picture types come into play. There's three types of frames:

You can list pictures types of a video using FFprobe and this command (it outputs JSON data):

ffprobe -v quiet -pretty -print_format json -show_entries "format=size,bit_rate:frame=coded_picture_number,pkt_pts_time,pkt_pts,pkt_dts_time,pkt_dts,pkt_duration_time,pict_type,interlaced_frame,top_field_first,repeat_pict,width,height,sample_aspect_ratio,display_aspect_ratio,r_frame_rate,avg_frame_rate,time_base,pkt_size" -select_streams v:0 video.mp4

Of course, video encoding algorithms are a lot more complex, but this is a simple way of understanding their basic behavior. For instance, this explains why you can not seek freely within a video file: the decoder needs to parse the nearby frames first to compute the one you are looking for.

During this reconstruction, encoders partly relies on pixel motion within the frame:

Motion vectors of P and B-Frames of a basketball rolling on a white background (details in FFmpeg's documentation)

Again, you can see this with FFmpeg:

ffplay -flags2 +export_mvs video.mp4 -vf codecview=mv=pf+bf+bb

The whole point of compressionism is to take two clips, and use the P and B frames of the second clip using the first clip I-Frame as reference. The image remains similar to the first clip, but you start to the artifacts from the second one. Typically, motions vectors of the second clip apply on pixels of the first one. This is really clever!

Here is a good read if you want to know more about this: Back to basics: GOPs explained, by Bryan Samis.

Industrial Espionage

In Tracks'video, at 2:51, the camera shows Jacques Perconte's screen with a Ruby script. I've reported it on GitHub. Basically, it prepares arguments for an ffmpeg commmand. Here are some of those arguments explained, according to the FFmpeg documentation:

Argument Value Description
-vf scale 5760:3240 set the output size
-vf flags neighbor use nearest-neighbor algorithm for scaling
-c:v libxvid use the XVID video encoder codec
-b:v 211277k set the video bitrate (good quality)
-sc_threshold 0 scenecut, the threshold for detection scene change (default: 40)
-g 20 keyframe interval, also known as GOP (group of pictures) length, is the maximum distance between I-frames (default: 250)
-me_method zero motion estimation method (default: zero, ie. no motion estimation)
-c:a libmp3lame use the LAME MP3 audio encoder
-b:a 256k set the audio bitrate (good quality)

It seems that the idea is to artificially increase the frequency of I-Frames (by decreasing the scenecut threshold and the GOP length). I guess this is in preparation for a second step (which is not showed in Track's video): dropping I-Frames. Making I-Frames frequent will increase the glitchy look of the video once they are removed.

Yet, we'll see later that such settings might not be enough for our needs.

Dropping Reference Frames

The plan is simple:

  1. Take a video file
  2. Split it into its frames
  3. Identify their types
  4. Delete I-Frames
  5. Rebuild the video file

And voilà! Hum, actually, one does not simply drop I-Frames. Current implementations of encoders are the result of decades of advanced research. So get ready for a dive into technical implementation details.

I chose to work with the h264 encoder, without any specific reason, but I had to start somewhere. I searched online for what was inside an MP4 file. It is a container type, made from sections, identified by a 4 bytes word. Sections starting with mdat contain actual data. For an h264 video, this sections contains an h264 stream.

This stream is a sequence of Network Abstraction Layer (NAL) Units. Each unit either contains metadata (such as encoding parameters) or frame slices. They start with a header, that we can parse to determine the unit's purpose. I found a very handy guide for this process: Introduction to H.264: (1) NAL Unit, by Yumi Chan. The header contains a nal_unit_type field, encoded as a 5 bit integer. Reference frames are of type 5. Interpolated frames are of type 1. The remaining isn't very interesting for us.

See that here, I do not mention I, P or B frames anymore. The documentation mentions "IDR" or "non-IDR" pictures, which means "(non-) Instantaneous Decoding Refresh" pictures. I was not willing nor able to dig deeper. My plan was simply to drop any but the first IDR frames, and rebuild the h264 stream.

So, I implemented that. It's available on GitHub, if you want to use or improve it. Here is how it looks:

Sunrise dive, an example of two concatenated clips with I-Frames dropped using my I-Frame dropping algorithm

The result is ok-ish. The moshing effect does work as intended. But you may notice that every once in a while, the video stutters (or jumps). This is because of an IDR frame being drop that messes up timings. Apparently, dropping one frame without updating others creates some issues. It's like I have to recompute the timing of the next frames, or something like that.

Also, I had a lot of issues with B-Frames, being counted as reference frames, which would make the stuttering even more violent. B-Frames can be removed using FFmpeg with the -bf 0 argument. The spacing of IDR frames can then be controlled with the -g (maximum spacing) and -keyint_min (minimum spacing) arguments.

So here I was: I had a working-ish algorithm, but I wanted something a little cleaner.

Exploring Existing Solutions

So I did what I should have done at the very beginning: looking for existing solutions online. I mostly looked at one subreddit, r/datamoshing.

Avidemux

Avidemux is an open-source video editor for Linux, Windows and MacOsX. And weirdly enough, it is mentionned a lot as a solution for doing datamoshing. The process is similar to the one I used:

  1. Encode a video with specific settings
  2. Drop I-Frames
  3. Export the result

What I do not understand though are the specific instructions one must follow for doing that. First, you'll need an old version of Avidemux, v2.5.6 (which is not even listed on the release page!). Then,

  1. Load a video into Avidemux. if it asks for using a safe mode, say "No", if it asks for rebuilding frames, say "No"
  2. Under the video export options tab, select "MPEG-4 ASP (Xvid)"
  3. Go to "Configure" and then "Frames", and set "Maximum I-frame Interval" to a large number, and "Maximum Consecutive B-frames" to 0
  4. Export the video as an AVI file
  5. Load the exported video
  6. Manually delete every I-Frames
  7. Export the video a second time

While I can guess why we need to set maximum consecutive B-frames to 0, I do not understand the rest of the settings. Why Xvid? Jacques Perconte also used it, is there a reason for this? And why setting the GOP length to a large number? That differs with Jacques Perconte. I might be misunderstanding some things though, so anyway, let's try it. Here is an example where you can really see how the moshing allows for smooth blending and transitions:

Datamoshing example using Avidemux

On a side note: here is a small AutoHotkey script for executing the 8th step automatically:

^f::
Loop 10 {
    Send,{Up}
    Sleep, 50
    Send,[
    Sleep, 50
    Send,{Right}
    Sleep, 50
    Send,]
    Sleep, 50
    Send,{Delete}
    Sleep, 100
}

It actually deletes the next 10 I-Frames when Ctrl+F is pressed.

MoshUp

Another really popular software I heard about is the Android application MoshUp. Top posts on r/datamoshing were realised using this app. And it very easy to use: you just have to record a video with your phone, pause the recording, and resume it. When paused, you can see an overlay of the current camera view on top of the last frame, allowing you to precisely align the next clip. The cut after resuming is when the moshing happens, probably by simply dropping the first I-Frame of the resumed clip.

Grabing a vinyl (left) and Library (right), datamoshing examples using MoshUp

Most examples on Reddit show somebody grabing an item, which gets replaced by another one with similar shape, sometimes with scaling effects. So I tried to do the same.

Audacity

Alright, yes. You can data mosh videos using Audacity.

Audacity let's you import raw binary files, normally audio files, but you certainly can import anything! If you try to listen to it, it will just sound like a very artsy piece. Once imported, you can apply audio effects on the video, and export it back.

I found this tutorial on the same subreddit. Basically, import, apply effects, and export:

  1. Convert your video into YUV format (.yuv) using FFmpeg; this creates an uncompressed file where pixels are stored in a flat manner, thus enabling some signal processing
  2. In Audacity, go to "File", "Import", "Raw Data", and choose parameters "A-Law" and "Little Endian"
  3. Apply audio effects
  4. Go to "File", "Export Audio", and choose "Save as Type: other uncompressed files", "Header: RAW (header-less)" and "Encoding: A-Law"
  5. Convert back the file using FFmpeg (be sure to change the resolution and the framerate):
ffmpeg -f rawvideo -vcodec rawvideo -s 1920x1080 -r 30 -pix_fmt yuv420p -i output.yuv -c:v libx264 -preset ultrafast -qp 0 output.mp4

I tried this on an old video of mine, and added some reverberation. Surprisingly, the file survived the transformation, and we even can see the reverb impact:

La Lune, a datamoshing example using Audacity

This actually is another type of data moshing. It it not about dropping reference frames, but simply applying weird effects on a video. And I thought that maybe, there is a way to replicate the usual datamoshing effect without actually semi-corrupting the file.

Optical Flow Transfer

Trying ohter datamoshing techniques made me realize that I did not have to mistreat my files to come up with the effect I wanted. Actually, what I was mostly looking for in datamoshing was the fact that objects from the second clip would subtly appear through their motion over the first clip's pixels. Thinking about this, I had another idea, way simpler to achieve: transfering optical flow.

Optical flow is a motion field (ie. defined in every point of space) representing the movement of the pixels between frames of a video. Basically, this is what we saw in the basketball video earlier in this article: video encoding uses motion vectors for representing differences between frames. In order to reproduce the effect I was aiming for, I simply had to take an image, and move its pixels according to the optical flow of a video.

If I ever take the time to dig into it, an efficient way of doing this could be to extract motion vectors directly from the encoded video file. But I rapidly found a library already implementing optical flow computation in Python, OpenCV. It implements the Farnebäck algorithm, which computes optical flow by representing pixels of an image as quadratic polynomials, and estimate local motion by computing the difference of two consecutives polynomials. After this, you get a 2D array indicating the movement of each pixels on the X and Y coordinates, which you can manually apply on a reference frame.

Example of an optical flow transfer. See the output only.

Now that this was working, I tried to create a little exhibition video. I chose an image I would reuse for each scene, and several clips where some movement is involved. Using the optical flow transfer technique, I created the datamoshed clips. For some of them, I also transfered the hue (as in HSL color representation) of the video pixels to the image, allowing for better object recognition and more colorful results. To tie everything together, I used a track from the band Cosmopaark that has been reversed, extended and modified by adding reverb and other effects. The original video clips come from the movie Mercuriales (Virgil Vernier, 2014). Here is the result:

YouTube Thumbnail of 234.32K video
234₃₂K (Yohan Chalier, 2022)

Update (2022-08-10): I implemented another demo of this optical flow transfer technique for the web browser, using the webcam as a live video source. Check it out on my GitHub!

Update (2024-07-10): I updated the script that drops I-frames to fix a stuttering issue. Frames have a timestamp in them, which leaves a gap when artificially removing a frame. There is a FFmpeg flag to reset frame timestamps, which solves the issue. Quite some time later, but still interesting.