Wednesday, February 1, 2012

Does this exist? (Text Analysis and Manipulation Software)

I've been thinking about doing something extremely silly for a while now but doing it would require the use of tools that might not actually exist. Specifically computer programs.

I know almost nothing of programming. I know enough to know that I don't know enough. Often times I've thought something would be easy only to be informed that it would be impossible, or thought something would be impossible only to be informed not only was it incredibly easy, it was already done. So I'm just going to ask, here's some programs I'd like to play with. Do they exist?

The first is a search and replace function that takes into account morphology and syntax.

Say one wanted to replace every instance of “have” in a document with an instance of “possess”. The first hurdle would be that many of the cases aren't going to be have. They're going to be had, or having. You won't find them with a simple search for have, you won't be able to replace them with the word possess. You need to be able to find instances of a word outside of the form given, and replace them with forms of the other word that match.

Second, “have” isn't always used the same way. You don't want, “I have been there many times,” to become “I possess been there many times.” That's where the syntax would come in. A verb auxiliary is different from a verb, and so a distinction can be made. (For that matter “have to” and “have” could, in theory, be automatically distinguished since ordinary “have” is never going to be followed by “to” and “have to” is, I'm pretty sure, never split.)

Or, to switch words for a moment, you don't want to confuse, “This is my land,” with, “I'm going to land the plane.”

I know that based on how something is used in a sentence computers can give a best guess at what part of speech it is. I have no idea how these programs work or how well their guesses hold up, but I know that something like that exists. What I don't know is whether they've been implemented in a search program where you could look up, say, all of the verb instances of the word “land.”

At some point it's going to break down because there will be things that can't be divided up by morphology and syntax, but if enough of it works right the things that work wrong should be part of the fun. The thing is, I don't know if any of this exists.

Is there a search function that takes into account morphology and syntax? Is there a replace function to go with it?

Also, if there is, could I, say, use it to get a list of the transitive verbs in a document ranked from most frequent to least frequent?  Intransitive? Verbs that go both ways? Nouns?  Adjectives?  Adverbs?  Stuff?  That might be an interesting thing to have.


Mostly changing direction, since I'm already talking about running analysis on a document using a program that has some way of working at syntax, would it be possible to, say, get a list of all the adjectives used to modify a given noun? (Preferably ranked from most frequent to least frequent.) Verbs with which it's the subject? Verbs with which it's the object? (Would probably want to subdivide verbs into active and passive.)

Is text parsing at a level where it can do a less than crap job of figuring out which noun a pronoun is associated with and use that as part of the gathering above?

If I did gather this kind of data, by whatever means, for multiple subjects is there something I could use to run an analysis on the multiple data sets looking for patterns? (Elements that often are used together, chords of text basically.) So that after comparing all the data one could have a set of patterns and be able to say, for example, “Set X is associated with patterns 1, 3, and 7”?

Perhaps I should be a bit less abstract.

Say I wrote by leaning heavily on stereotypes and as a result I used the same language to describe all of my male characters. I may not use every word in that with every male character, but it should theoretically be possible to discern a pattern because the words tend to be used together in the male characters, and tend not to be used in the female ones. More than that, it should be possible to discover that pattern even if you don't know what you're looking for.

Since the language of maleness is being used on some characters (male ones) but not others (female ones) looking for words that tended to occur together should result in you finding that language and being able to classify characters into the category that is described using it, and the category that is not. When you looked at which characters fell into which categories you'd probably realize that it had to do with gender, but the process itself should be able to be done with no such thinking.

Of course, that wouldn't be the only thing. Maybe all evil characters are described using a set of stereotypical adjectives and verbs as well. Perhaps the same goes for leaders. A character might be an evil male leader. They might be the only evil male leader, in which case the entire set of words associated with them should be fairly unique, but within it one would be able to find words from three separate patterns (male, evil, leader.) A female good leader should be recognizable as being in one, and only one, of the same patterns as the male evil leader.

In reality it wouldn't be nearly that clear cut or that easy to detect, there would be a lot of noise to deal with not to mention the problems of small sample size and doubtless many headaches, but as I mentioned before, I'm thinking of doing silly things, so I'm not actually wondering if anything exists that does this sort of thing well. I'm wondering if something exists that does this sort of thing at all.

And as with the search and replace, mistakes might be part of the fun, but only if it works right to some degree.


Now I started with nouns, and then moved into characters for a specific example, but I'd be interested in if something similar could be done with other parts of speech. What objects does this verb take? What are some other verbs that take a similar set of objects. Ditto for subjects. What nouns does this adjective modify? What other adjectives modify a similar set of nouns?

That sort of thing.


So, I've described multiple programs here. Do any of them exist?



  1. Nondisclaimer: I'm a computer programmer.

    I'm sure the sort of program you describe exists, because I've seen Google search suggestions and results that can best be explained using it. (E.g., by your example, I search for "have" and get results for "possess" as well.) However, it'd be very hard to code and would be impossible to get perfect. You'd need to teach it human speech. For the adjective-modifying-noun question, you'd need to show it all the ways we arrange words and how the different arrangements can be diagrammed. The synonym-in-context question would be even worse: you'd not only need to give it a thesaurus, but you'd need to teach it how to infer from context which of several meanings a given word means in this particular instance.

    I see a couple ways I could start doing this, but I'd need funding from a big company - like Google - to get it reasonably good. Of course, Google isn't sharing their code. Perhaps a lot of people could get together in an ad-hoc network over the web to do it, but I've no idea whether it's already happened.

  2. Google doesn't actually do this - their algorithms notice that a lot of phrases using the word "have" also appear elsewhere with the word "possess" in place of "have" - but it's really just pattern-matching against a *huge* database. It's the same strategy they use for translation - match pieces of the text against a large database of translations.

    Unfortunately, what you want is *really* hard in computer terms. The computer basically has to understand natural English. It's been the subject of intensive research for at least 30 years, and it's nowhere close to being solved.There are probably some experimental programs in AI research labs at places like MIT which could do a very half-assed version of this (by which I mean they get things right maybe 60-70% of the time), but nothing that's commercially available.

    1. What Redwood said. "Natural language parsing" is the key phrase; while it's possible to get results that look good enough in the short term (as with Google suggestions), the sort of textual analysis tool you're looking for doesn't exist at a usable level of quality.

      There are some worthwhile textual analyses - Brian Vickers recently brought out a book on his analyses of Shakespeare's plays with a view to detailed considerations of authorship - but those are looking for word frequencies and phrase patterns.

      (If you could get people to write in lojban it would probably be much easier. This is a joke.)

  3. The synonym-in-context question would be even worse: you'd not only need to give it a thesaurus, but you'd need to teach it how to infer from context which of several meanings a given word means in this particular instance.

    I wasn't actually thinking primarily of synonyms, I just used have-possess as an example because it's something a reasonable person might want to do and thus seemed like a better example than, say, die and sit. Though theoretically, as intransitive verbs, you should be able to swap the two out and get grammatical sentences, even if they aren't sensible ones.

    What you can't do is swap across parts of speech, and have does a good job of showing why because sometimes it's a verb, and sometimes it's a tense marker*, which are two very different things.

    Land illustrates the same thing. It doesn't matter whether you replace the noun with something that kind of sort of makes sense, like "terrain", or you replace it with something that makes no sense, like "demon duck of doom", either way the grammar will still make sense after the replacement. What you can't do without destroying the grammatical sense is to replace the verb "to land" with a noun (or noun phrase as in the case of "demon duck of doom".)

    That's what I was talking about, not a computer understanding which definition is used. I could see a computer working for part of speech**, I could see it working for morphology***, I very much doubt we'll have a computer selecting the correct definition of a word when there isn't an obvious indicator any time in the near future.

    I may be off on what computers are capable of doing, but I'm not quite that far off.

    Google doesn't actually do this - their algorithms notice that a lot of phrases using the word "have" also appear elsewhere with the word "possess" in place of "have" - but it's really just pattern-matching against a *huge* database.

    That's how I would do what Google does if I were in their place. It seems the natural solution. Which makes me somewhat surprised that Google does it. Usually what seems natural to me is not what seems natural to others.


    * Ok, technically it's a verbal auxiliary used to indicate perfect aspect, but "tense marker" fits into a sentence more nicely.

    ** I've seen it used to translate English into a phonetic alphabet. Something like "live" is spoken differently, and thus has a different phonetic spelling, depending on whether it is a verb or an adjective so syntax is used as a part of the automated translation process.

    No idea how it works or what the rate of success is though.

    *** I've seen it done with some success in Latin, where things are, I think, both easier and harder.

  4. cjmr's husband says, "Yeah, we were working on similar software to that [in the early 90s] but that wasn't what we were planning to use it for."

    So the answer may very well be, yes, there is software that does that, but it's classified.

  5. Another programmer chiming in. I had a college course covering the theory of natural language processing(i.e. understanding the syntax of a normal language, not one constrained to a specific system).

    Google has access to huge masses of text in various languages. It relying on pattern recognition rather than understanding the meaning of a given word; most prior theory required building a giant database of every possible meaning of a word, and then training a program off of prepared documents with everything properly tagged as to which meaning. From that you could start the program on any block of text with a moderate degree of success.

    A brute force sort of way would be to return a database table of every word used in the input text tagged with the preceding 1 to 3 words in front of it. You could then find some sort of dictionary file/service(like used with spelling and grammar checking programs) to filter down to the nouns as the primary words. You could then try filtering out non-adjectives from the attached words.

    I think or a similar page may offer a web service (a defined protocol to feed it input and get results back that is not tied to a specific programming language). It would require research, free time, and hopefully not overloading a third party site to the point they disconnect the web service.

  6. I don't think you would have hope for pronoun matching to original noun; there are too many ways a bad writer can mess things up.

    Just a question though; what would be the main use for these hypothetical tools/programs? That really alters how something should be designed.

    To test/improve your own writing?
    - Best bet would be to start with existing grammar checkers, and then see what extra tests to create that aren't covered.

    To analyze large numbers of documents(i.e. term paper grading or critical analysis of published documents)?
    - Focus on building statistics. Don't worry about picking out the correct meaning of every word. More important to point out repetitive phrases, repeated mistakes, or unclear passages.

    If you have a situation with a large number of special cases, it is often easier to just program something to summarize the data and let the user make a decision.

    1. I was guessing that even if other parts were possible pronouns would be extremely problematic, not just because of bad writers introducing ambiguity unintentionally but also because good writers sometimes do the same for poetic effect. Though the latter happens much more rarely. Especially since my primary interest is in prose where poetic effect is used much more rarely than it would be in something like, say, poetry.

      As for how I was thinking of using them, the word replacement would be for entirely silly things. For the rest I was thinking more as analysis of existing works.

      If you're looking at how an author presents a character, for example, it is not unusual to look through the words used to describe that character. (Go through the book, take a tally of every word used to describe the character, see what emerges.) Which is all well and good if you're doing it for the purposes of serious business, but the time sink is pretty massive if you're doing it for personal interest. Especially if you'd like to compare that to how said author describes other characters across several books (or, say, to compare how several different authors treat the same character.)

      The up side is that if you're not doing it for academia but instead for yourself you can allow for a much lower standard, which means that if a streamlining process gets some things wrong it doesn't matter so much. That would be an acceptable loss if it reduced the amount of time one had to put into the endeavor.

      Especially given how much a faster process could potentially be used for. If you want to compare the presentation of protagonists in books by author A as compared to books by author B doing that by hand means reading every book by author A, noting all of the words used to describe the protagonist in each, reading every book by author B, noting all of the words used to describe the protagonist in each. And you've just read a bunch of books, possibly dozens, in one of the most unpleasant ways possible. (Stopping whenever the protagonist has a description to add that to a running list isn't exactly going to let you enjoy the story.)

      Whereas, if you could automate the process then you'd only have to identify the protagonist of each and let the program run to get the data. It would be significantly less daunting. Which might also lead to doing the same in places you never would have considered doing it before. Look at how the automation of word frequency checks, as in word clouds, has expanded the use of that to things people would have never previously considered using them on.

    2. My best design approach is to focus on gathering and organizing information first. Databases are wonderful for this. Once you have a table of information, you can sort, filter, and rearrange endlessly.

      Hmm, table design:
      id (unique identifier)
      word (the actual current word)
      word position(first word on a given page is 1, count from there)

      Unsure on labeling by paragraph or sentence; those would require more work to identify depending on style. Conversations between characters are sometimes placed so each sentence is its own paragraph. Sentences would require adding rules to ignore abbreviations, periods in numbers, unusual punctuation(!?!?), etc.

      With that, you could easily get statistics (overall, by chapter, by page) of word frequency. With a little work you could find all words that precede words in a given list(to try to find adjectives).

      Word replacement for silly things: see Penny Arcade, Harry Potter, and wand.

      Of course, all of this relies on having the text you receive in a machine readable format. Proprietary file types, DRM, and image-only scans would all get in the way.