Saturday, January 5, 2013

Is there software for parsing text into bare grammar?

In all honestly, the previous post, which is a story beginning, is probably more important.

Ok, so I've sort of asked this question before, but it was surrounded by other questions and kind of got overlooked, I think.

Grammar is rather complicated, and thus can be difficult to parse for something like a computer because it hasn't grown up in a human fashion with the oddities that come naturally to those of us who grew up in an environment of colloquialisms and things that made sense even if not quite right.

On the other hand, people have been trying to get computers to be useful with respect to grammar for around as long as I've been alive, so who knows what's out there now.

The most basic thing I'd like to be able to do is take a block of text and strip all the meaning from it leaving only the grammar:

Far out in the uncharted backwaters of the unfashionable end of the Western Spiral arm of the galaxy, lies a small, unregarded yellow sun. Orbiting this at a distance of roughly ninety-million miles is an utterly insignificant blue-green planet whose ape-descended lifeforms are so amazingly primitive that they still think digital watches are a pretty neat idea. This planet has, or had, a problem which was this: Most of the people living on it were unhappy for pretty much of the time.

[adverb] [preposition] [preposition] [article] [adjective] [noun] [preposition] [article] [adjective] [noun] [preposition] [article] [adjective] [adjective] [noun] [preposition] [article] [noun], [verb] [article] [adjective], [adjective] [adjective] [noun]. [adjective] [pronoun] [preposition] [article] [noun] [preposition] [adverb] [adjective]-[adjective] [noun] [verb] [article] [adverb] [adjective] [adjective]-[adjective] [noun] [pronoun] [noun]-[adjective] [noun] [verb] [adverb] [adverb] [adjective] [conjunction] [pronoun] [adverb] [verb] [adjective] [noun] [verb] [article] [adverb] [adjective] [noun]. [pronoun] [noun] [verb], [conjunction] [verb], [article] [noun] [pronoun] [verb] [pronoun]: [noun] [preposition] [article] [noun] [adjective] [preposition] [pronoun] [verb] [adjective] [preposition] [adverb] [noun] [preposition] [article] [noun].

Or something like that.  (I did the above while my dog was being particularly ill mannered, could be quite wrong)

Except it would probably need more categories (Wiktionary lists 46 pages in its "part of speech" category and it doesn't get into how many objects a transitive verb has.)

But at it's most basic I'm looking for something that strips meaning and outputs grammar.  Does such a thing exist?

Seriously, if you have an answer stop reading now and tell me before you get side tracked, then you can come back and read the rest of the post.


The second most basic thing would be to keep distinct things distinct So rather than all prepositions being turned into [preposition] every prepositional "out" becomes "[Preposition 1]" every prepositional "in" becomes [Preposition 2], every prepositional "of" becomes [Preposition 3] and so on.

If you coupled this with a list of what [(Part of speech) N] was in the original text then you'd have all of the information necessary to reconstruct the original text, so unlike the first thing, you've lost no information in the process, but you have gained, for example, a way to compare sentence structures to see if any stand out as particularly common or particularly rare.  And do all sorts of silly things.  In all honesty I'm probably more interested in the silly things.

Does such a thing exist?


What I would really like would be something that could also show grammatical relationships, and that would (I think) be a lot more difficult than either of the above. The up side is I'm thinking of prose not poetry so there shouldn't be too much of people exploiting the ambiguity of the language for the sake of high art.  Things should in theory be more straightforward.

But that doesn't mean they're not complicated because it means the computer:
1) Figuring out which adjectives modify which nouns (and being aware of any "not"s or "no"s or other negating terms)
2) Figuring out which verbs each noun is a subject of.  (Or agent of when the verb is passive)
3) Figuring out which verbs each noun is an object of. (Or subject of when the verb is passive.)
4) Figuring out what noun a given pronoun goes with. (Which will be very useful for all of the above and below.)
5) Figuring out which adverbs each noun is associated with.
6) Doing all of the above (except the pronoun bit) with parts of speech other than nouns.  (E.g. I'm looking at [Verb 236] what are the nouns that are subjects of it?  What adverbs modify it?  What are it's objects?  What are the adjectives of its subjects/objects.)

I assume such a thing doesn't exist, but I'm open to being pleasantly surprised.

To put this part in more this blog terms, it might be interesting to compare the verbs of Edward Cullen to the verbs of Rayford Steele.  But to do that you need to:
1 Be able to identify words as verbs (probably include participles in that)
2 Be able to identify which noun a given pronoun goes with (don't want stuff getting left out because the author said "he" rather than "Ray"/"Ed")
3 Be able to identify which verbs have as their subject a given noun or its associated pronoun.

And then combine duplicate terms which I assume you'd do manually, because I doubt you'll get a computer to be able to realize that Ray, Rayford, Rayford Steele, Mr. Steele, Captain Steele, and sometimes things like "Dad" "Sir" and "Captain" all refer to the same person.

And then at that point you'd just list the verbs by frequency for each character and see if you've learned anything or if the whole exercise was a waste of time.


  1. I'm not aware of any tools that do what you want for human language (although they probably exist), but this is essentially what compilers do for programming languages: they produce an abstract syntax tree that can be manipulated in useful ways. (In that case, the result is an optimized intermediate form that is then transformed into executable machine code.) In this way, a compiler front-end can be used with multiple back-ends to produce code for different architectures.

    "Formal language" is the study of this sort of thing:

    (I'm not an expert on this; I just took "Formal Languages and Automata" and "Compiler Design" for a computer science degree...)

  2. There does appear to be a "Parts of Speech" tagger which might do the most basic of the functions you were looking at. But it's not perfect and you have to spend time "training" it.

    For the more complicated ones, well, human languages are just too ambiguous. Decades of research has gone into trying to have computers parse natural language, and we're still crap at doing anything but very restricted versions.

    To be honest, the syntax approach to understanding human languages has mostly been abandoned in the software industry. The successful natural language applications such as Siri or Google Translate instead take the approach of comparing the input against *massive* databases of known samples/translations in order to extract the meaning, mostly without even trying to understand the syntax.

  3. I think this would be really hard to do in English because so many words can be both nouns and verbs without any change of form.