Binary classification of MIDI artefacts as “music” or “not music”

(c) Ed Earl 2019; all rights reserved.

Generating music in a symbolic form such as MIDI format or something broadly equivalent is attractive for many reasons: it adequately describes a wide range of genres of Western tonal music, including Western art music, rock, pop and (tonal) jazz; it converts easily to sheet music; MIDI is a well-established data standard, supported by plenty of existing libraries and software; and the MIDI data specification is expressive, yet concise enough for flexible computation.

Taking MIDI format as the domain of investigation, the possibility space, it’s easy to start creating artefacts: for example, specify a length of three minutes, eight channels (or parts, or voices), and a tempo of 120bpm; at each quarter note time step, randomly either assign each part a random MIDI note (0 to 127), tie the previous note, or assign a rest. Every artefact generated by this simple algorithm is overwhelmingly likely to be unpleasant to listen to. Moreover, I assert that such artefacts will be perceived by casual listeners as not music.

We can partition the MIDI space by the likely judgement of typical listeners into non-musical artefacts, “musical” (though not necessarily “pleasant” or “good”) artefacts, and aesthetically “good” musical artefacts. We might start to approach the question of how to recognize “good” music by first considering how to recognize passable music: how to judge whether a given MIDI artefact will pass as “musical” to typical listeners. (Here I’m using the word “musical” to refer to tonal music in the broad Western tradition.)

Although people’s everyday appreciation of music is largely unconscious and intuitive, I believe it should be possible to identify and implement musical rules (or properties, or guidelines, or constraints) sufficient for a MIDI artefact to be perceived as musical. More work is needed, but the goal is achievable.

Musical rules

Clearly not all of these rules apply to all Western tonal musical works (and my present purpose isn’t to make prescriptions or explore questions of legitimacy), but my intuition is that most of the rules will apply most of the time; they loosely characterise Western tonal music. These rules are essentially those which are taught to human students of composition, and it makes intuitive sense that the rules which would make artefacts sound like music to typical listeners would be the rules which human composers actually study and follow. Here are some loose descriptions.

Voices or parts

Music is comprised of voices or parts. The exact number of voices may change over the course of a piece, but the number of independent, melodically coherent voices or parts at any given time is on the order of a handful. Voices are essentially monophonic, and may be supported by “extra notes” in the form of doubling or the addition of (usually chordal) tones. (These “extra notes” aren’t subject to constraints on melodic coherence.)

Melodic coherence

Each voice plays within a typical register. Examined horizontally, voices proceed mostly by step, sometimes by skip. Wide skips are avoided. Chord tones are mostly approached by skip (or notes approached by skip are mostly perceived as chord tones). Notes are divided into phrases of about the length of a breath, and phrases are separated by pauses.

Phrases can be understood as being comprised of a number of sub-phrase elements, each a few notes long. These sub-phrase elements are used throughout the musical work, with repetition and variation; distinct phrases are built from a common set of sub-phrase elements.


At any given time, a single chord and a single scale are usually recognisable. Harmonic rhythm generally proceeds on a slower timeframe than note rhythm. Chord progressions follow their own rules of horizontal coherence, discussed extensively in musical literature; the rules vary by musical genre (and partially define genre), although there are some common fundamentals, such as the power of V-I cadences to establish key.

Examined vertically, not only do the notes played by all voices comply or agree with the current chord, they “express” or “project” the chord: enough voices are playing chord tones, and enough of the tones comprising the chord are played by voices, for the chord to be recognisable and unambiguous.

The term “examined vertically” is an oversimplification. In music, it’s rare for all the tones of a chord to be sounded simultaneously; however, they will usually all be sounded at some point over the course of the several beats or bars during which the chord is held. The root of the chord is usually played in a bass voice, and usually on the first beat at which the chord appears; the root note usually occurs most frequently, followed by the fifth and then the third.

Meter and rhythm

A (usually) regular pulse is recognisable, along with a recognisable groove or beat. The rhythms of notes, and especially the rhythm of non-pitched parts, express and project the beat. Some voices have more responsibility for this than others; the bass voice has particular importance. There’s often a fairly clear distinction between “rhythm parts” (which nevertheless mostly follow the rules of melodic coherence) and “melody parts” (which nevertheless agree with and complement the beat).

On the scale of small groups of notes, a relatively small number of distinct rhythms is used over the course of a musical work. Each voice uses its own set of rhythms, not precluding overlap; two voices may use similar or quite different rhythms. The overall distribution of rhythms within a work exhibits both repetition and novelty.

Overall structure of a work

Musical works are comprised of a handful of sections (with some variation between occurrences of a section). Chord progressions have an important role in giving sections their identity and distinguishing them from each other.

The structure of a work governs how musical material is presented to the listener, conveying the overall interplay of expectation and surprise through repetition, variation and novelty. In some genres, such as art music, the statement and development of motives over the course of a work is an important part of the structure.

Formalization and implementation

Some aspects of musical analysis and composition have been well formalized in the literature, in a way which makes them fairly simple to implement in computer programs. Rules for voice leading (which I have mentioned under the heading of melodic coherence) are an example. Using commonly taught rules of voice leading as additional constraints on random MIDI generation does seem to produce artefacts which sound more musical, which suggests that a MIDI artefact’s compliance with voice leading rules makes it more likely to be perceived as musical.

Other aspects of analysis and composition have not been well formalized, and stating them in formal terms poses a significant problem. My use above of terms such as “mostly” and “usually” glosses over the way in which elements must actually be structured in order to pass as “musical”. Much more research is needed in some areas.

The distribution of rhythms is one example. Artefacts consisting of a single repeated rhythm are not perceived as musical, but neither are artefacts with no repetition of rhythms. Where is the middle ground - how can we distinguish musical from non-musical artefacts based on the distribution of rhythms? Naively, we might identify the distinct rhythms of groups of (say) three notes in a particular work of music, and count the number of occurrences of each rhythm in the work. If we use these statistics to parameterize random variables, and generate new music by sampling the random variables, the resulting artefacts still do not seem to pass as musical. If the aim is to be able to classify artefacts as “non-musical” or “musical”, simple statistical properties of occurrences of rhythms do not seem to provide sufficient conditions.