Blog

Datamoshing

Aug. 3, 2022

I once saw a very intriguing YouTube video: [Les peintures numériques de Jacques Perconte](https://www.youtube.com/watch?v=8_Xhu9Vx5XM), on [Tracks](https://www.youtube.com/c/TRACKSARTEFr), a show produced by the Franco-German television channel [Arte](https://www.arte.tv/fr/) and dedicated to emergent forms of art. In the video, the French artist Jacques Perconte exhibits his work, that relies on data moshing. He encodes videos in such a way that he loses data without corrupting the file, resulting in mesmerizing video artifacts. And he plays with that to merge several pictures in evocating ways.
Extracted from Avant l'effondrement du Mont Blanc (2020)
I found that amazing, and wanted to try it myself. So first, I tried to replicate it. # A First Attempt at Compressionism To get an understanding of the principle, one must know a little bit about video encoding. ## A Quick Introduction to Video Encoding Basically, a video is a sequence of still images, called frames. You can take a series of JPGs and concatenate them into a video file, for instance, using FFmpeg: ```console ffmpeg -framerate 30 -i frame_%03d.jpg -c:v libx264 -pix_fmt yuv420p output.mp4 ``` But storing each individual frame of a video is an insanely inefficient solution. For a one minute HD video at 30 frames per second and 8 bit color depth, the file size would be more than 10 Gb. Instead, video files are compressed, using the fact that in a real video, the next frame remains close to the previous one. So, if you know one frame, you just have to store the difference with the next one, and re-compute it on the fly when playing, and then do the same for the next frame, and so on. When the scene changes, you can store a new reference frame, and compute differences from that new starting point. So here is where [video compression picture types](https://en.wikipedia.org/wiki/Video_compression_picture_types) come into play. There's three types of frames: - An **I-Frame** is a reference frame, a full picture - A **P-Frame** relies on previous frames for reconstruction - A **B-Frame** relies on previous and next frames for reconstruction You can list pictures types of a video using FFprobe and this command (it outputs JSON data): ```console ffprobe -v quiet -pretty -print_format json -show_entries "format=size,bit_rate:frame=coded_picture_number,pkt_pts_time,pkt_pts,pkt_dts_time,pkt_dts,pkt_duration_time,pict_type,interlaced_frame,top_field_first,repeat_pict,width,height,sample_aspect_ratio,display_aspect_ratio,r_frame_rate,avg_frame_rate,time_base,pkt_size" -select_streams v:0 video.mp4 ``` Of course, video encoding algorithms are a lot more complex, but this is a simple way of understanding their basic behavior. For instance, this explains why you can not seek freely within a video file: the decoder needs to parse the nearby frames first to compute the one you are looking for. During this reconstruction, encoders partly relies on pixel motion within the frame:
Motion vectors of P and B-Frames of a basketball rolling on a white background (details in FFmpeg's documentation)
Again, you can see this with FFmpeg: ```console ffplay -flags2 +export_mvs video.mp4 -vf codecview=mv=pf+bf+bb ``` The whole point of *compressionism* is to take two clips, and use the P and B frames of the second clip using the first clip I-Frame as reference. The image remains similar to the first clip, but you start to the artifacts from the second one. Typically, motions vectors of the second clip apply on pixels of the first one. This is really clever! Here is a good read if you want to know more about this: [*Back to basics: GOPs explained*, by Bryan Samis](https://aws.amazon.com/blogs/media/part-1-back-to-basics-gops-explained/). ## Industrial Espionage In Tracks'video, at [2:51](https://www.youtube.com/watch?v=8_Xhu9Vx5XM&t=171s), the camera shows Jacques Perconte's screen with a Ruby script. I've reported it on [GitHub](https://github.com/ychalier/datamoshing/blob/main/perconte.rb). Basically, it prepares arguments for an `ffmpeg` commmand. Here are some of those arguments explained, according to the [FFmpeg documentation](https://sites.google.com/site/linuxencoding/x264-ffmpeg-mapping): Argument | Value | Description -------- | ----- | ----------- `-vf scale` | `5760:3240` | set the output size `-vf flags` | `neighbor` | use nearest-neighbor algorithm for scaling `-c:v` | `libxvid` | use the XVID video encoder codec `-b:v` | `211277k` | set the video bitrate (good quality) `-sc_threshold` | `0` | scenecut, the threshold for detection scene change (default: 40) `-g` | `20` | keyframe interval, also known as GOP (group of pictures) length, is the maximum distance between I-frames (default: 250) `-me_method` | `zero` | motion estimation method (default: zero, ie. no motion estimation) `-c:a` | `libmp3lame` | use the LAME MP3 audio encoder `-b:a` | `256k` | set the audio bitrate (good quality) It seems that the idea is to artificially increase the frequency of I-Frames (by decreasing the scenecut threshold and the GOP length). I guess this is in preparation for a second step (which is not showed in Track's video): dropping I-Frames. Making I-Frames frequent will increase the glitchy look of the video once they are removed. Yet, we'll see later that such settings might not be enough for our needs. ## Dropping Reference Frames The plan is simple: 1. Take a video file 2. Split it into its frames 3. Identify their types 4. Delete I-Frames 5. Rebuild the video file And voilà! Hum, actually, one does not simply drop I-Frames. Current implementations of encoders are the result of decades of advanced research. So get ready for a dive into technical implementation details. I chose to work with the h264 encoder, without any specific reason, but I had to start somewhere. I searched online for what was inside an MP4 file. It is a container type, made from sections, [identified by a 4 bytes word](https://docs.fileformat.com/video/mp4/). Sections starting with `mdat` contain actual data. For an h264 video, this sections contains an h264 stream. This stream is a sequence of [Network Abstraction Layer (NAL) Units](https://en.wikipedia.org/wiki/Network_Abstraction_Layer). Each unit either contains metadata (such as encoding parameters) or frame slices. They start with a header, that we can parse to determine the unit's purpose. I found a very handy guide for this process: [*Introduction to H.264: (1) NAL Unit*, by Yumi Chan](https://yumichan.net/video-processing/video-compression/introduction-to-h264-nal-unit/). The header contains a `nal_unit_type` field, encoded as a 5 bit integer. Reference frames are of type 5. Interpolated frames are of type 1. The remaining isn't very interesting for us. See that here, I do not mention I, P or B frames anymore. The documentation mentions "IDR" or "non-IDR" pictures, which means "(non-) Instantaneous Decoding Refresh" pictures. I was not willing nor able to dig deeper. My plan was simply to drop any but the first IDR frames, and rebuild the h264 stream. So, I implemented that. It's available on [GitHub](https://github.com/ychalier/datamoshing), if you want to use or improve it. Here is how it looks:
Sunrise dive, an example of two concatenated clips with I-Frames dropped using my I-Frame dropping algorithm
The result is ok-ish. The moshing effect does work as intended. But you may notice that every once in a while, the video stutters (or jumps). This is because of an IDR frame being drop that messes up timings. Apparently, dropping one frame without updating others creates some issues. It's like I have to recompute the timing of the next frames, or something like that. Also, I had a lot of issues with B-Frames, being counted as reference frames, which would make the stuttering even more violent. B-Frames can be removed using FFmpeg with the `-bf 0` argument. The spacing of IDR frames can then be controlled with the `-g` (maximum spacing) and `-keyint_min` (minimum spacing) arguments. So here I was: I had a working-ish algorithm, but I wanted something a little cleaner. # Exploring Existing Solutions So I did what I should have done at the very beginning: looking for existing solutions online. I mostly looked at one subreddit, [r/datamoshing](https://www.reddit.com/r/datamoshing). ## Avidemux [Avidemux](https://github.com/mean00/avidemux2) is an open-source video editor for Linux, Windows and MacOsX. And weirdly enough, it is mentionned a lot as a solution for doing datamoshing. The process is similar to the one I used: 1. Encode a video with specific settings 2. Drop I-Frames 3. Export the result What I do not understand though are the specific instructions one must follow for doing that. First, you'll need an old version of Avidemux, v2.5.6 (which is not even listed on the [release page](https://github.com/mean00/avidemux2/releases)!). Then, 1. Load a video into Avidemux. if it asks for using a safe mode, say "No", if it asks for rebuilding frames, say "No" 2. Under the video export options tab, select "MPEG-4 ASP (Xvid)" 3. Go to "Configure" and then "Frames", and set "Maximum I-frame Interval" to a large number, and "Maximum Consecutive B-frames" to 0 4. Export the video as an AVI file 5. Load the exported video 6. Manually delete every I-Frames 7. Export the video a second time While I can guess why we need to set maximum consecutive B-frames to 0, I do not understand the rest of the settings. Why Xvid? Jacques Perconte also used it, is there a reason for this? And why setting the GOP length to a large number? That differs with Jacques Perconte. I might be misunderstanding some things though, so anyway, let's try it. Here is an example where you can really see how the moshing allows for smooth blending and transitions:
Datamoshing example using Avidemux
On a side note: here is a small [AutoHotkey](https://www.autohotkey.com/) script for executing the 8th step automatically: ```ahk ^f:: Loop 10 { Send,{Up} Sleep, 50 Send,[ Sleep, 50 Send,{Right} Sleep, 50 Send,] Sleep, 50 Send,{Delete} Sleep, 100 } ``` It actually deletes the next 10 I-Frames when Ctrl+F is pressed. ## MoshUp Another really popular software I heard about is the Android application [MoshUp](https://play.google.com/store/apps/details?id=com.pytebyte.moshup). Top posts on [r/datamoshing](https://www.reddit.com/r/datamoshing) were realised using this app. And it very easy to use: you just have to record a video with your phone, pause the recording, and resume it. When paused, you can see an overlay of the current camera view on top of the last frame, allowing you to precisely align the next clip. The cut after resuming is when the moshing happens, probably by simply dropping the first I-Frame of the resumed clip.
Grabing a vinyl (left) and Library (right), datamoshing examples using MoshUp
Most examples on Reddit show somebody grabing an item, which gets replaced by another one with similar shape, sometimes with scaling effects. So I tried to do the same. ## Audacity Alright, yes. You can data mosh videos using [Audacity](https://www.audacityteam.org/). Audacity let's you import raw binary files, normally audio files, but you certainly can import anything! If you try to listen to it, it will just sound like a very artsy piece. Once imported, you can apply audio effects on the video, and export it back. I found this tutorial on the same [subreddit](https://www.reddit.com/r/glitch_art/comments/if4bh3/tutorial_databending_video_files_with_audacity/). Basically, import, apply effects, and export: 1. Convert your video into [YUV](https://wiki.videolan.org/YUV) format (`.yuv`) using FFmpeg; this creates an uncompressed file where pixels are stored in a flat manner, thus enabling some signal processing 2. In Audacity, go to "File", "Import", "Raw Data", and choose parameters "A-Law" and "Little Endian" 3. Apply audio effects 4. Go to "File", "Export Audio", and choose "Save as Type: other uncompressed files", "Header: RAW (header-less)" and "Encoding: A-Law" 5. Convert back the file using FFmpeg (be sure to change the resolution and the framerate): ```console ffmpeg -f rawvideo -vcodec rawvideo -s 1920x1080 -r 30 -pix_fmt yuv420p -i output.yuv -c:v libx264 -preset ultrafast -qp 0 output.mp4 ``` I tried this on an old video of mine, and added some reverberation. Surprisingly, the file survived the transformation, and we even can see the reverb impact:
La Lune, a datamoshing example using Audacity
This actually is another type of data moshing. It it not about dropping reference frames, but simply applying weird effects on a video. And I thought that maybe, there is a way to replicate the usual datamoshing effect without actually semi-corrupting the file. # Optical Flow Transfer Trying ohter datamoshing techniques made me realize that I did not have to mistreat my files to come up with the effect I wanted. Actually, what I was mostly looking for in datamoshing was the fact that objects from the second clip would subtly appear through their motion over the first clip's pixels. Thinking about this, I had another idea, way simpler to achieve: transfering optical flow. [Optical flow](https://en.wikipedia.org/wiki/Optical_flow) is a motion field (ie. defined in every point of space) representing the movement of the pixels between frames of a video. Basically, this is what we saw in the basketball video earlier in this article: video encoding uses motion vectors for representing differences between frames. In order to reproduce the effect I was aiming for, I simply had to take an image, and move its pixels according to the optical flow of a video. If I ever take the time to dig into it, an efficient way of doing this could be to extract motion vectors directly from the encoded video file. But I rapidly found a library already implementing optical flow computation in Python, [OpenCV](https://opencv.org/). It implements the [Farnebäck algorithm](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.102.2455&rep=rep1&type=pdf), which computes optical flow by representing pixels of an image as quadratic polynomials, and estimate local motion by computing the difference of two consecutives polynomials. After this, you get a 2D array indicating the movement of each pixels on the X and Y coordinates, which you can manually apply on a reference frame.
Example of an optical flow transfer. See the output only.
Now that this was working, I tried to create a little exhibition video. I chose an image I would reuse for each scene, and several clips where some movement is involved. Using the optical flow transfer technique, I created the datamoshed clips. For some of them, I also transfered the hue (as in [HSL](https://en.wikipedia.org/wiki/HSL_and_HSV) color representation) of the video pixels to the image, allowing for better object recognition and more colorful results. To tie everything together, I used a track from the band [Cosmopaark](https://cosmopaark.bandcamp.com/album/sunflower) that has been reversed, extended and modified by adding reverb and other effects. The original video clips come from the movie [*Mercuriales* (Virgil Vernier, 2014)](https://www.imdb.com/title/tt3454612/). Here is the result:
234₃₂K (Yohan Chalier, 2022)
**Update (2022-08-10):** I implemented another demo of this optical flow transfer technique for the web browser, using the webcam as a live video source. [Check it out on my GitHub](https://ychalier.github.io/datamoshing/optical-flow-webcam/)!