Technological and Computer-based Projects



Auto Video: Automatic Video Editing Driven Algorithmically By Parameters of Audio
Version 1.0 complete
Using: Jitter, Max/MSP


Presented at the Hyde Park Salon, Dec 5 2009.

This project was conceived as a way to make short films that track perfectly the contours of a musical recording. Inspired vaguely by the concept of a Fourier transform, in which a signal is described in terms of a series of equations, this is video artwork defined by applying a complex algorithm to an audio signal.

Pre-existing video is loaded into a Jitter/Max/MSP patch, and various aspects of the sound are used to drive decisions about which video to display, when to cut, when and how to use special video effects, and so on. The creativity involved in making these videos, then, comes not from working directly with any one final product, but rather from sculpting the algorithm.

As I have worked, the project has evolved into a tool allowing nontechnical users to take, for example, home movies and favorite songs and cut them together in an aesthetically-satisfying way that goes far beyond the "soundtrack" or "slideshow" options available on current consumer (or even professional) software.

The example shown here was shot on Super 8mm film in January 2009 in Bath, UK. The audio is "Sorely Missed" by the band The Cedar.

Summary— Project stages:
  1. Version 1.0: Video automatically driven from several clips to express the contours of the chosen audio file. Potential users would have to give me the files, and let me set up the session and tweak parameters before a strong product emerged.
  2. Version 2.0: Users are presented with a simple GUI to load/upload video and audio files, and simple, straightforwardly-labeled sliders to adjust pertinent parameters in the algorithms within proscribed bounds. Some aspects of the video and audio files are read in automatically from analysis, but the user needs to spend some time "playing" to get a product that expresses the feel of the song, and is varied enough but still cohesive. In Version 2, this process could take place without guidance from someone trained in the software, allowing the algorithm to be shared as a simple runtime download.
  3. Version 3.0: Users upload files at a public terminal, and all analysis happens automatically, or with a bare minimum of adjustments. The software automatically finds points of change in the audio and video and use them to create a short film that tracks the audio. The algorithm has an "editor personality" similar to a human editor, but the same video clips used for different songs will create very different videos, and the same song can power many combinations of video clips into films not recognizable as having been created by algorithm, rather than by human decision.
  4. Version 4.0: All of this ported to [potentially jMax] Java environment, so the entire process can take place on the Web. Users can simply paste in links of the YouTube videos they wish to use, and upload an mp3 for their audio. Processing happens on the server running the jMax port of Auto Video, and users are presented with the final product to view, download, or transfer to their own YouTube user account.
Project Description

I envision a website where musicians, filmmakers, and interested laypersons can choose three YouTube links[1] (their own work, or anything they enjoy and want to mash-up) and upload an mp3; then, with some automated analysis[2] they are presented with a small-ish number of simple continuous control sliders and a few on/off toggles that are labeled descriptively, not technically. By manipulating these they control the way the video output is processed, cut and re-ordered by analysis of parameters of their new movie's soundtrack. When they are happy with the final product, simply click "Render" and in a few moments they are presented with a new YouTube link[3] displaying a HD version of the work they have just co-created.

I imagine this being used by families wanting to create a simple, yet compelling overview of a collection of home movies; perhaps baby's first steps or the family's interview with Grandma on her 85th birthday, set to a favorite song.

I also can see it being used by independent musicians and bands who have recorded exciting new material and have shot footage with a quality video camera, but whose artistic expertise does not extend to the world of video editing.

Lastly, I can see it being used as tool by professional and student video editors, as a sort of "intuition pump" the way, say, Brian Eno's Oblique Strategies or iPhone app might be used by sound designers and composers. Perhaps these users wouldn't use the final output directly, but would instead use its parameters to be inspired in their own manual creations. That said, perhaps this algorithmic editor's "personality" would be persuasive enough, its voice strong enough, that it would actually see its style copied in human-edited work.

Realization in Max/MSP/Jitter

The Max/MSP/Jitter patch is very complex, yet there is still much that might be added. To take an overview, when the patch opens it automatically loads an audio track into a buffer object, and up to three video tracks into separate jit.qt.movie objects; designating A, B, and C section of the piece. A simple start/stop patch controls audio and video together on a press of the space bar, and stops both (and resets the patch) on a press while playing. To start the various activities throughout the patch, a bang is sent to a "send start" object, so that any processes needing to be banged on start can simply receive the message and act accordingly. Major subpatchers are: ReadPeaks, the audio-analysis patcher sending calibrated data to FramePicker, which then jumps to an appropriate frame in the current video clip based on current analysis parameters. DimLy makes and implements an amplitude-controlled video downsampling routine, active about 5% of the running time of the patch: softer sections are more likely to be pixelated (note that in the demo video, the pixelations come in during the first verse— only when Neil isn't singing) and that pixelates more intensely the quieter the source audio becomes. (There's an extreme example of pixel downsampling at the top of the song.) Several other, unnamed, subpatchers exist whose purpose is more abstract.

Current Capabilities

The current version of the patch produces, in real time, highly responsive video edited to specific parameters of audio to a higher degree of matching than would be possible by a human editor, even without the constraint of operating in real-time. Because this patch contains no tempo- or beat- followers, it will react instantly to changes in tempo, never be confused by a change or groove or a dropped beat, and will work just as well with nonpercussive input, such as a string quartet, as it will with the static-tempo, electronically-produced beat-heavy music for which beat-following works so well.

Current Limitations

While the current version of this patch gives acceptable results, it does not do so quickly or automatically. Prominently, the A, B and C sections of the piece must be hand-entered and are not auto-detected. Additionally, while the algorithms as currently written work well with a variety of music, they don't adapt to variance between pieces of music and therefore are not sensitive to small changes and are also apt to fail when given music that is very different from the norm.

The scope of effects applied to the video is very small; more options should be added to keep the algorithm from producing "the same video" for each song. (It should be noted that even "the same video" i.e. the same algorithmic settings, with different audio and video inputs, would produce a very different result. That said, a wider range of artistic choices would give the patch's "personality" more depth.

Relatedly, the number of sections is currently small. In the demonstration song[4] there is a horn section, a post-chorus, and a full-band instrumental that go unrecognized by the editing of the video. More effects would allow targeted variety in the final product, without requiring more clips as input. Automatic detection of sections could address this problem concurrently.

Next Steps

Integrating, for example, Eronen[5] and Goto's[6] work on chorus detection as well as Tristan Jehan's wide-ranging and powerful suite of Max externals would go a long way toward realizing a patch that can choose its own A, B, and C sections. At that point, a pre-scan ability will be programmed into the patch so that it may automatically read maximums and minimums, means and standard deviations and use these to set bounds on the parameters it is manipulating to create video edits. One filter to be added to the list of options include simple color-tint swells based on pitch center (attempted in this version, but with unsatisfactory results) where a given bass note or chord would gradually swell a certain color into the scene; a change in the bass note would reset the colors and start a new swell.

Relatively easy to implement, but inappropriate for the current demo song, is a "thump" detector that rotates the picture a few degrees if a low, kick-drum-like attack is detected. The amount of rotation would be proportional to the volume in that frequency band, giving a visual sense of how hard the aural low end is being hit. For loud, aggressive music this could be stylistically appropriate and add significantly to the effect made by the sort of footage that would likely be chosen to match songs in genres such as industrial, punk, metal, Another video filter would be chromakeying, where two clips play simultaneously and a certain color range in one clip is made transparent. This "green screen"-like effect ranges from breathtaking to corny, so it might be best left as a user-selectable option. Perhaps only a human can decide (or disagree with another human, or for that matter a robot-editor) where on the scale a specific combination of clips and audio ends up.

From Local to Global

As this project is developed, it will be tested with a growing variety of music and video. Concurrently, I will be looking into ways to run this as a self-service application accessed via the Web. Jmax (A fledgling port of Max to the Java environment) is, sadly, defunct but it would be possible to write a simple web GUI acting as a connection carrying commands to a copy of this patch running locally on the server, and uploading video to YouTube or to the user's browser's cache for previewing as settings are tweaked.






1. Live video, of course, is a possibility, although as we shall see, it would necessarily deliver a much different final product. I am leaving this enticing option for the time being, pending further development of this idea with pre-existing video. [jump back up]
2. Similar to the previous note, it would be relatively easy to adapt this technology to live music, even compared to live video. I hope to try this out at one of my own performances, once the algorithms are in place; audiences might find the dynamic crosscutting and processing even more compelling when it is obviously all being done in real-time, rather than being possibly a painstaking manual edit to correspond to a recorded track. [jump back up]
3. Vimeo, Hulu, or a download, depending on copyright issues and so on. [jump back up]
4. "Sorely Missed" by The Cedar, from their 2008 album "I'm always explaining to mom how it is different here". Used with permission. [jump back up]
5. Antti Eronen, Chorus Detection With Combined Use Of MFCC And Chroma Features And Image Processing Filters. At http://dafx.labri.fr/main/papers/p229.pdf [jump back up]
6. Masataka Goto: http://staff.aist.go.jp/m.goto/ [jump back up]