Wikis as Multigraphs of Text

Where I explore designing a type definition of a Wiki, without code this time

I have a lot of wacky computer science ideas, more of which are going to be appearing as other blog posts in the future. This is the most recent one, and therefore, the least pickled.

What is a Multigraph, anyway?

This is a concept in Discrete Mathematics, and it is a more general form of a Graph. So, what is a graph?

I think that the best way to explain these kinds of graphs is to use a different term: network. Most people have heard of a network. A social network includes you and all of your friends and family, and all of their friends and family, and so on and so on. I’m going to borrow a picture from Wikipedia.

Imagine that each circle with a number is a person, and each line between them is a friendship. That means that 1 is friends with 2 and 5, while 6 is only friends with 4.

However, since graphs are more abstract than groups of friends, we can use them to represent many different things. And there are different kinds of graphs too! The multigraph is where the edges (those lines between the circles) can be different kinds of edges. In the picture below, they are different colors.

A multigraph with three kinds of edges: grey, red, and blue

What I am theorizing is that a multigraph can represent wikis themselves.

How are Wikis Multigraphs?

Graph of Hyperlinks

Wikis have hyperlinks between documents. Here is a hyperlink to the Wikipedia page on hyperlinks. We can represent this kind of link between one bit of text with another in a graph. Let’s take a look at that first picture again:

Now imagine that each of those circles represent a Wikipedia page, and that the lines between them represent a hyperlink.

The Outline is a Directed Acyclic Graph

A Directed Acyclic Graph (DAG) is a type of graph where all the edges have a sense of direction (directed) and there are no loops (acyclic). A good, real life example of this is a family tree (from Wikipedia again).

This DAG can’t have loops in it because people cannot be a parent and a child to themselves, nor can you be your own grandpa (biologically anyway). The directed nature of this graph can go either way: you can say that the arrow points to parents, or to children. Typically, we want our DAGs to point to one particular thing, so we will say that the arrows point to parents.

Why am I bringing this up? Because I realized that the outline structure of a document is also a DAG. Each Heading has several blocks of text under it, like the children in the family tree. And each Heading itself is a subheading of a bigger Heading. The concept of the “document” itself can be seen as the Lucas Grey of our family tree. Each document in a wiki is a DAG.

Therefore Multigraph?

So, you have a graph of text blocks that form DAGs of documents, but each of those text blocks can also connect up to other text blocks in other DAGs with hyperlinks. So, what if we had a graph with two kinds of edges, the Outline kind, and the Hyperlink kind? An entire wiki could be theoretically stored in a single multigraph of text blocks, with Outline DAGs encoding documents, and Hyperlink graphs representing the connections between all the documents.

Conclusion

I believe that the type definition of a Wiki would be a multigraph of text. I think that you could store an entire wiki inside of a graph database and have a really interesting architecture for a wiki engine. But that’s for next time 🙂

Source Code in a Wiki?

This is an add on to my previous post, Wikis as Multigraphs of Text.

As of right now, most source code is stored as text files in a file system. And this has worked out fine for the industry for decades. But, there are other ways of storing programs. What if, we stored source code in a wiki?

As a reminder, from my previous blog post, I stated that a Wiki data type could be encoded as a multigraph, that is, a graph with at least two kinds of edges, Outline, where we encode text block order for output, and Hyperlink, where we track hyperlink references between blocks.

Why should we put source code in wikis?

For the purposes of this blog post, I’d like to avoid the question of the usefulness of this paradigm, and just make this statement: I think this is a fun thought experiment, not a call for change.

If/when I do implement some kind of system based on these ideas, I will report that in another post.

I think that a wiki could be an excellent way for a group of programmers to maintain a codebase over time. Utility scripts could be encoded with documentation alongside, with comments and examples included. If the wiki was version controlled, then all of that information would be as well. All alongside your code, instead of in a different place. And if that wiki was a multigraph, then you can use graph theory to structure your code instead of a bytestream.

Here’s a made up example: let’s say that there is a license/copyright comment block at the top of every one of your files in your code base. In the real world examples of these, often there was a script or automation of some kind that made sure that all of these comment blocks were up to date. If source code was encoded in a graph, then you could have only one vertex in the graph that has that license file, then you can have any number of references to it. If you have the system present the code as a file system (as a particular view into your graph data), then all the files would have that same comment block at the top.

How does this relate to Literate Programming?

This is an evolution of Literate Programming, and as such, the tangle and weave functions from that paradigm are required. The weave function creates a website that allows users to view the graph data as prose. The tangle function creates that view into the graph data of a filesystem with the source code in plaintext.

What types do we need for the multigraph?

So, the types needed to encode this change would have to expand to include different kinds of text blocks, and include a new type of edge.

In the last post, the Wiki only has one type of block, which is just Text. However, in the case of differentiating between different kinds of blocks, we need to make at least two kinds of blocks: Prose and Code. If we wanted to have our wiki be a polyglot environment, we can have the Code block encode what language the code is in as Text or possibly a sum type.

The Outline edge type can also be divided into two kinds, which I’m naming after the two Literate Programming functions: Tangle and Weave. Both the Tangle and Weave types could include a filepath as a property. So, during the tangle operation, the system would output the files as they should be for compilation, and the weave operation would generate the pages for the wiki for user viewing.

The End

This is a fairly new idea for me, and I don’t know a lot of details as of yet. I’m going to read up on Tinkerpop and Goblin and see if I can’t test out some of these ideas on my own.

Thanks for reading!

Multiple Choice Language Feature Programming Languages

where I describe an a-la-carte language feature compiler toolchain

Semantics in general

When I was first learning programming, I thought that the syntax of a language (keywords, spacing, how to write functions, &c) was the only thing that I needed to learn. Then I started learning about something called programming language semantics. Semantics in general refers to the meaning behind words and symbols, both in linguistics and in programming language theory.

For an extended example, I took CS50, the Introduction to Computer Science course that Harvard offers via edx.org. In the 2014 version of the course, they taught C as the main language. So, I learned about memory management with alloc and friends. Then, I taught myself Python after the course, and I learned about the “garbage collector,” and that was my first lesson in semantics. Even though the languages didn’t make a specific syntactic declaration about how memory is managed, the semantics of the two runtimes are quite different.

Semantic Differences Matter

In the many discussions I’ve had about programming languages at meetups and other places, is that although syntax is something that many programmers care deeply about, one thing that really separates languages apart are their semantics. Specifically, their language features that are built behind the scenes, like garbage collection, or a module system, and so on. This is also a source of language envy and innovation. Object orientation is a language feature that adds specific semantics to a computer language; ditto for functional semantics.

In theory, you could have the same syntax but many, many different semantic components, and you’d describe a whole slew of languages and their associated run-times.

What if we made that a feature? What if there was a language built that allowed for multiple semantic components that could be chosen for each program written? It would be similar to Racket’s #lang syntax, but maybe in a build description rather than each file separately.

Semantics a-la-carte?

Imagine that there was a programming language system where all major language features were multiple choice.

Choice 1: Type Checker

What if we could choose which type checker we wanted to use within the same toolchain? Turn it off completely for simple scripts, turn it all the way up for complex applications.

  1. Unitype Checker: only report runtime errors, be completely dynamic
  2. Gradual Type Checker with type inference (with compilation errors?)
  3. Full Powered Type System with all the compilation errors

Choice 2: Memory Allocation

I’ve heard more than one programmer say that they miss C because of the simplicity of the system. I don’t know if that includes memory management, but I do know that for some instances, garbage collection is not appropriate. So, what if the garbage collector was optional?

  1. Manual Memory Management, like C
  2. Garbage Collection, like Python and Haskell

In fact, there could be multiple garbage collectors built into a toolchain, each with different properties. So, you can declare at the start of a project, what kind of memory semantics you want to have, without changing the syntax, and with the type system also being a-la-carte.

Choice 3: Concurrency System

One thing about Python that hurts performance but makes the semantics easier to understand to a beginner in the GIL, which forces Python to be single-threaded. Well, CPython at least. But what if Python was made in this a-la-carte fashion, and projects could decide if they want the GIL or not? Or go even further, to a system like in Erlang?

  1. Single-threaded only
  2. Simple Message Passing System
  3. Something like OTP/BEAM in Erlang

How many “languages” is this?

What I’ve described here could counted as 12 different languages. Or, if built right, could be one language with 12 different operating modes.

This system would require something like 8 different components to the compilation toolchain, in addition to the usual ones, like lexer/parser, assembler/linker, and so on. So, there would be a lot of work. But, the upside is that more people could use the language to fit their own needs. A language where a component could be written as a deep research topic, and allow users to determine if they want to use that component or not?

Specifically, look at GHC, the de facto compiler for Haskell. It has so many different language extensions, and often they come in groups. Instead of having to specify a collection of extentions for type-level programming, there was a “Type-Level Programming” option? So, when you wanted to level up your Haskell, you could turn that feature on? That component could have its own focused documentation on how to use it properly.

Possible Problem: too much work

Would the additional work be worth the benefit? I have no idea.

Is this too weird of an idea?

Is there any merit to this proposed approach? I’ll openly admit it’s a thought experiment.

Comment below to let me know!