Simple Markup

PHP
POWERED


Simple Mark Up is a "Text to Text Algorithm" – any input text to any output text. Download: smu-1.4.2.zip.
Oh, and this might be useful: php-debug.zip.

SMU is not at all related to textutils like SED and AWK.

The "From > To" text definitions are external to the main code, in an associated array of regular expressions and replacement strings.

One does not need complex, multi-line expressions to transform text. SMU uses short, simple, easy to understand (and to write) regular expressions (why simple is in the name).

The design is, that the main code neither knows nor cares anything about "how" the text is translated.

Generally, the input data will be formatted, i.e. created by humans, such as "Markdown" text. The output data too will generally be formatted for humans[], but need not be.

However, with SMU, even very complex text such as (the worst of) HTML can be parsed with regular expressions.

Herein, the term "text" implies formatted text; "data" means text that may be unformatted or even binary. While either may be created/written/parsed by machine, the former is generally written by humans.

Version

This is version 1.4.2, 5th release, September 2022.

The current base algorithms are of just under 600 lines of PHP code. And SMU is making the claim that Markdown markup is now implemented in just 250 lines of PHP data and code.

Usage

The main code is not called directly, but is included by a "definition file"; these are of the notation smu_<name>.php, where name kinda indicates what it does.

A definition file – or as is sometimes called, a "dataset" or "the DATA" – defines data in the form of PHP associative arrays (of particular names) that get applied to the input text to convert to the output text.

For example, to convert Markdown to HTML:

php smu_md.php readme.md > readme.html

Command Line Interface

While "typical" use of this code was to create (modify and existing) dataset to perform new conversion rules, there is a CLI to adjust the runtime configuration of existing datasets.

For example, to convert Markdown to HTML:

./smu readme.md smu_md.php > readme.html

The input file and data file swapped is only because there is a default data file, smu_.php, described next.

Non-Usage

By non-usage is meant using the code without a dataset. The main code is just an API and does nothing, it is to be used by a dataset, with the "default dataset" used as a template.

The way into the API is the function simple_markup(), which is called with a file name (or the input text to use).

Given this test text, where the character · is used for a newline:

One.·
Two.·
·
Three four.·

The default dataset is like:

include 'smu.php';
echo
simple_markup($argv[1]);

which used as ./smu test_text.txt will output:

One.·
Two.·
·
Three four.·

Which is the same. By design.

To do more to the output a combination of defined constants and data are used. For example, here is applying "greediness":

const MU_GREEDY = 1;
include
'smu.php';
echo
simple_markup($argv[1]);

which will output:

One. Two.·
·
Three four.·

Adding const HTML = 1; it will output:

<p>One. Two.</p>·
·
<p>Three four.</p>·

The CLI is more comprehensive with options to apply those changes, -greedy=1 and -html=1, and much more.

DATA

The purpose of SMU is to apply DATA to convert input TEXT to output TEXT, with the DATA as simple arrays.

There are two types of arrays one for "blocks" and one for "inlines", with keys of regular expressions and values their replacement text.

Lines

An example from Markdown is to replace all # Header lines with <h1>Header</h1>, with a regular expression like:

/^#\s*([^#]*)(#*)$/

with a replacement string like:

<h1>$1</h1>

It's just that SMU does not have dozens of complex, multi-line expressions applied to an entire input text, but simple expressions in an array applied to each line of input like:

$mu_lines = [
   
'/^#\s*([^#]+)$/' => '<h1>$1</h1>',
];

If that seems more complicated, one might be surprised.

Inlines

Inlines, or emphasis in Markdown terminology, is similar, with emphasis like:

$mu_inlines = [
   
'/\*(.+)\*/U' => '<em>$1</em>',
];

The actual expression used handles word boundries, escapes etc.; but that is the basic expression.

Data Code

The value of any regular expression can be a function, which is how more complex output can occur, like multi-line blocks such as for ordered and un-ordered lists. This is called "data code".

A really simple example is one expression for all Markdown headers with the data code for it a closure like:

$mu_lines = [
   
'/^(#+)\s*([^#]+)/' => function($m){
        return
"<h".$n=strlen($m[1]).">${m[2]}</h$n>";},
];

Which shows that assigned functions get passed the $matches argument of preg_match(). Or the data can just be the function name, like md_header for this to be used:

function md_header($m) {
   
$h = trim($m[2]);
   
$n = strlen($m[1]);
    return
"<h$n>$h</h$n>";
}

This way all datasets can be easily expanded to do anything in two ways, the array data or it's functions, like:

function md_header($m) {
   
$h = trim($m[2]);
   
$n = strlen($m[1]);
   
$i = str_replace(' ','_',strtolower($h));
    return
"<h$n id=\"$i\">$h</h$n>";
}

Conclusion

SMU implements Markdown to HTML in under 800 lines of code and data.

The SMU Motto: "You can change the code without changing the code."


The SMU Algorithm

This is a slightly "trimmed" version of the SMU Algorithm. It is made up of two other (sub) algorithms that do the "lines" and "blocks" markup and then the "inlines" markup.

<?php #
/* simple_markup - the simple markup function - main entry point */

function simple_markup($data) {
   
$s = '';            /* return string */
   
$p = MU_P;            /* paragraph wrappers "paragraph" */
   
$b = MU_B;            /* "break" */
   
$n = MU_EOL;        /* newline */

   
smu_open($data);

    while ((
$_ = smu_read($data)) !== NULL) {

       
$l = markup_lines($_,$data);

        if (
$l === FALSE) {
           
markup_getline($_,$data);
           
$l = markup_inlines($_);
            if (
$l) {
               
$l = "$p$l$b";
            }
        }

       
$s .= "$l$n";
    }

    return
$s;
}

The initialization of the local variables with the defined constants makes the code easier to read and work with.

For the "opening" of the input text, $data is overloaded and passed by reference. Overloaded in that it was a file name (string) and then becomes the file handle.

Most people will decry such overloading and modification. No need to justify that usage now though. (If a reader is grimacing no one is forcing them to continue.)

The algorithm start is the "while (there's another line)" loop based on smu_read(), with the first line read into $_, which is NULL to indicate no more.

The single letter variable names (and $_) are also a convention, and $l is overloaded and $_ passed by reference.

Same caveats as previous note.

The markup_lines() function, itself another algorithm, will either markup a line or multiple lines or not. If it did markup some text the result ($l) will be part of the formatted output, if not its return is FALSE (and the unmodified line is still in $_).

The markup_inlines() function, itself another algorithm, does the inlines markup (or not). But before that the markup_getline() function is used to possibly read a paragraph (greediness as defined by Markdown).

"Inlines" can sometimes be called "emphasis", as in what Markdown does, though SMU does nothing on it's own—the DATA defines what "inlines" are.

The result of inlines markup is the line(s) markedup or not, in $l. Then the line (paragraph) is enclosed within the "paragraph wrappers", $p and $b—for HTML it's the typical <p> and </p> tags; for TEXT $p and $b are (usually) empty.

The formatted output text keeps growing by append, which upon loop termination gets returned.

That is all. And is why "Simple" is used in the name. The functions that do the markup are next.

Lines Markup

This is what SMU does: it applies a list of regular expression replacements to each line of input text—with the ability to group lines together.

Here is the lines algorithm and commentary.

function markup_lines(&$line, &$data) {
global
$mu_lines;

    foreach (
$mu_lines as $regex => $repl) {
       
$n = preg_match($regex,$line,$m);
        if (!
$n) {
            continue;
        }
        if (
is_object($repl) || function_exists($repl)) {
           
$res = $repl($line,$regex,$data,$m);
            return
$res;
        }
       
$res = preg_replace($regex,$repl,$line);
        return
$res;
    }

    return
FALSE;
}

The lines array is a list of regular expression keys with a value of what to do with, or how to apply, the regex to the line. If the value is a function that function is applied to the line. Otherwise the value is a replacement string applied to the line.

Two very important points with this are: 1) $data being passed by reference is how a (defined data) function can read multiple lines for block markup, such as a Markdown quote block; 2) if a regex matches a line the result is returned and lines markup stops for that line.

The function returns markedup text or FALSE to indicate that no markup occurred.

While some people too will decry multiple return points in a function, there is a reason for doing so here. Again, this code is not being forced on anyone.

Inlines Markup

The inlines algorithm is similar in architecture but differing in process.

function markup_inlines($line) {
global
$mu_inlines;

    foreach (
$mu_inlines as $regex => $repl) {
       
$n = preg_match($regex,$line);
        if (!
$n) {
            continue;
        }
        if (
is_object($repl) || function_exists($repl)) {
           
$line = preg_replace_callback($regex,$repl,$line);
        }
        else {
           
$line = preg_replace($regex,$repl,$line);
        }
    }

    return
$line;
}

That applies a list of regular expressions to a line, applying them all to the line if there is a match. Again the value for a regex is a function or a replacement string.

The line is always returned, either markedup or not.

Conclusion

An astute reader may have seen the flaws of this code, one of operation, one of performance.

First is exposed by a simple question: "How does line/block markup apply inlines markup?" In this version (1.4.2, Autumn, 2022) the function(s) that markup blocks calls the markup_inlines() function, if it wants. While that does add some complexity to the overall process, it's justification is it also adds flexibility.

Just how block inlines markup occurs is documented elsewhere.

Then, a reader may be thinking, "All inlines regular expression definitions are applied to each line?" Yup. "Why not apply them in turn to the entire input at once?" Nope. For the SMU markdown data there are about 23 regular expressions, and average about 32 characters. This code ain't gonna be slow.

This code has, needs, no complex, multi-line regular expressions, with perhaps dozens of assertions. That is one of the reasons for this code.