Stealing Commas: Does this exist? (Text Analysis and Manipulation Software)

Wednesday, February 1, 2012

Does this exist? (Text Analysis and Manipulation Software)

I've been thinking about doing something extremely silly for a while now but doing it would require the use of tools that might not actually exist. Specifically computer programs.

I know almost nothing of programming. I know enough to know that I don't know enough. Often times I've thought something would be easy only to be informed that it would be impossible, or thought something would be impossible only to be informed not only was it incredibly easy, it was already done. So I'm just going to ask, here's some programs I'd like to play with. Do they exist?

The first is a search and replace function that takes into account morphology and syntax.

Say one wanted to replace every instance of “have” in a document with an instance of “possess”. The first hurdle would be that many of the cases aren't going to be have. They're going to be had, or having. You won't find them with a simple search for have, you won't be able to replace them with the word possess. You need to be able to find instances of a word outside of the form given, and replace them with forms of the other word that match.

Second, “have” isn't always used the same way. You don't want, “I have been there many times,” to become “I possess been there many times.” That's where the syntax would come in. A verb auxiliary is different from a verb, and so a distinction can be made. (For that matter “have to” and “have” could, in theory, be automatically distinguished since ordinary “have” is never going to be followed by “to” and “have to” is, I'm pretty sure, never split.)

Or, to switch words for a moment, you don't want to confuse, “This is my land,” with, “I'm going to land the plane.”

I know that based on how something is used in a sentence computers can give a best guess at what part of speech it is. I have no idea how these programs work or how well their guesses hold up, but I know that something like that exists. What I don't know is whether they've been implemented in a search program where you could look up, say, all of the verb instances of the word “land.”

At some point it's going to break down because there will be things that can't be divided up by morphology and syntax, but if enough of it works right the things that work wrong should be part of the fun. The thing is, I don't know if any of this exists.

Is there a search function that takes into account morphology and syntax? Is there a replace function to go with it?

Also, if there is, could I, say, use it to get a list of the transitive verbs in a document ranked from most frequent to least frequent? Intransitive? Verbs that go both ways? Nouns? Adjectives? Adverbs? Stuff? That might be an interesting thing to have.

Mostly changing direction, since I'm already talking about running analysis on a document using a program that has some way of working at syntax, would it be possible to, say, get a list of all the adjectives used to modify a given noun? (Preferably ranked from most frequent to least frequent.) Verbs with which it's the subject? Verbs with which it's the object? (Would probably want to subdivide verbs into active and passive.)

Is text parsing at a level where it can do a less than crap job of figuring out which noun a pronoun is associated with and use that as part of the gathering above?

If I did gather this kind of data, by whatever means, for multiple subjects is there something I could use to run an analysis on the multiple data sets looking for patterns? (Elements that often are used together, chords of text basically.) So that after comparing all the data one could have a set of patterns and be able to say, for example, “Set X is associated with patterns 1, 3, and 7”?

Perhaps I should be a bit less abstract.

Say I wrote by leaning heavily on stereotypes and as a result I used the same language to describe all of my male characters. I may not use every word in that with every male character, but it should theoretically be possible to discern a pattern because the words tend to be used together in the male characters, and tend not to be used in the female ones. More than that, it should be possible to discover that pattern even if you don't know what you're looking for.

Since the language of maleness is being used on some characters (male ones) but not others (female ones) looking for words that tended to occur together should result in you finding that language and being able to classify characters into the category that is described using it, and the category that is not. When you looked at which characters fell into which categories you'd probably realize that it had to do with gender, but the process itself should be able to be done with no such thinking.

Of course, that wouldn't be the only thing. Maybe all evil characters are described using a set of stereotypical adjectives and verbs as well. Perhaps the same goes for leaders. A character might be an evil male leader. They might be the only evil male leader, in which case the entire set of words associated with them should be fairly unique, but within it one would be able to find words from three separate patterns (male, evil, leader.) A female good leader should be recognizable as being in one, and only one, of the same patterns as the male evil leader.

In reality it wouldn't be nearly that clear cut or that easy to detect, there would be a lot of noise to deal with not to mention the problems of small sample size and doubtless many headaches, but as I mentioned before, I'm thinking of doing silly things, so I'm not actually wondering if anything exists that does this sort of thing well. I'm wondering if something exists that does this sort of thing at all.

And as with the search and replace, mistakes might be part of the fun, but only if it works right to some degree.

Now I started with nouns, and then moved into characters for a specific example, but I'd be interested in if something similar could be done with other parts of speech. What objects does this verb take? What are some other verbs that take a similar set of objects. Ditto for subjects. What nouns does this adjective modify? What other adjectives modify a similar set of nouns?

That sort of thing.

So, I've described multiple programs here. Do any of them exist?

Thanks.

9 comments:

EvanFebruary 1, 2012 at 8:48 PM
Nondisclaimer: I'm a computer programmer.

I'm sure the sort of program you describe exists, because I've seen Google search suggestions and results that can best be explained using it. (E.g., by your example, I search for "have" and get results for "possess" as well.) However, it'd be very hard to code and would be impossible to get perfect. You'd need to teach it human speech. For the adjective-modifying-noun question, you'd need to show it all the ways we arrange words and how the different arrangements can be diagrammed. The synonym-in-context question would be even worse: you'd not only need to give it a thesaurus, but you'd need to teach it how to infer from context which of several meanings a given word means in this particular instance.

I see a couple ways I could start doing this, but I'd need funding from a big company - like Google - to get it reasonably good. Of course, Google isn't sharing their code. Perhaps a lot of people could get together in an ad-hoc network over the web to do it, but I've no idea whether it's already happened.
ReplyDelete
Replies
RedwoodFebruary 1, 2012 at 11:23 PM
Google doesn't actually do this - their algorithms notice that a lot of phrases using the word "have" also appear elsewhere with the word "possess" in place of "have" - but it's really just pattern-matching against a *huge* database. It's the same strategy they use for translation - match pieces of the text against a large database of translations.

Unfortunately, what you want is *really* hard in computer terms. The computer basically has to understand natural English. It's been the subject of intensive research for at least 30 years, and it's nowhere close to being solved.There are probably some experimental programs in AI research labs at places like MIT which could do a very half-assed version of this (by which I mean they get things right maybe 60-70% of the time), but nothing that's commercially available.
ReplyDelete
Replies
chris the cynicFebruary 2, 2012 at 6:49 AM
The synonym-in-context question would be even worse: you'd not only need to give it a thesaurus, but you'd need to teach it how to infer from context which of several meanings a given word means in this particular instance.

I wasn't actually thinking primarily of synonyms, I just used have-possess as an example because it's something a reasonable person might want to do and thus seemed like a better example than, say, die and sit. Though theoretically, as intransitive verbs, you should be able to swap the two out and get grammatical sentences, even if they aren't sensible ones.

What you can't do is swap across parts of speech, and have does a good job of showing why because sometimes it's a verb, and sometimes it's a tense marker*, which are two very different things.

Land illustrates the same thing. It doesn't matter whether you replace the noun with something that kind of sort of makes sense, like "terrain", or you replace it with something that makes no sense, like "demon duck of doom", either way the grammar will still make sense after the replacement. What you can't do without destroying the grammatical sense is to replace the verb "to land" with a noun (or noun phrase as in the case of "demon duck of doom".)

That's what I was talking about, not a computer understanding which definition is used. I could see a computer working for part of speech**, I could see it working for morphology***, I very much doubt we'll have a computer selecting the correct definition of a word when there isn't an obvious indicator any time in the near future.

I may be off on what computers are capable of doing, but I'm not quite that far off.

Google doesn't actually do this - their algorithms notice that a lot of phrases using the word "have" also appear elsewhere with the word "possess" in place of "have" - but it's really just pattern-matching against a *huge* database.

That's how I would do what Google does if I were in their place. It seems the natural solution. Which makes me somewhat surprised that Google does it. Usually what seems natural to me is not what seems natural to others.

-

* Ok, technically it's a verbal auxiliary used to indicate perfect aspect, but "tense marker" fits into a sentence more nicely.

** I've seen it used to translate English into a phonetic alphabet. Something like "live" is spoken differently, and thus has a different phonetic spelling, depending on whether it is a verb or an adjective so syntax is used as a part of the automated translation process.

No idea how it works or what the rate of success is though.

*** I've seen it done with some success in Latin, where things are, I think, both easier and harder.
ReplyDelete
Replies
cjmrFebruary 5, 2012 at 6:00 PM
cjmr's husband says, "Yeah, we were working on similar software to that [in the early 90s] but that wasn't what we were planning to use it for."

So the answer may very well be, yes, there is software that does that, but it's classified.
ReplyDelete
Replies
KellandrosFebruary 6, 2012 at 3:09 PM
Another programmer chiming in. I had a college course covering the theory of natural language processing(i.e. understanding the syntax of a normal language, not one constrained to a specific system).

Google has access to huge masses of text in various languages. It relying on pattern recognition rather than understanding the meaning of a given word; most prior theory required building a giant database of every possible meaning of a word, and then training a program off of prepared documents with everything properly tagged as to which meaning. From that you could start the program on any block of text with a moderate degree of success.

A brute force sort of way would be to return a database table of every word used in the input text tagged with the preceding 1 to 3 words in front of it. You could then find some sort of dictionary file/service(like used with spelling and grammar checking programs) to filter down to the nouns as the primary words. You could then try filtering out non-adjectives from the attached words.

I think dictionary.com or a similar page may offer a web service (a defined protocol to feed it input and get results back that is not tied to a specific programming language). It would require research, free time, and hopefully not overloading a third party site to the point they disconnect the web service.
ReplyDelete
Replies
KellandrosFebruary 6, 2012 at 6:11 PM
I don't think you would have hope for pronoun matching to original noun; there are too many ways a bad writer can mess things up.

Just a question though; what would be the main use for these hypothetical tools/programs? That really alters how something should be designed.

To test/improve your own writing?
- Best bet would be to start with existing grammar checkers, and then see what extra tests to create that aren't covered.

To analyze large numbers of documents(i.e. term paper grading or critical analysis of published documents)?
- Focus on building statistics. Don't worry about picking out the correct meaning of every word. More important to point out repetitive phrases, repeated mistakes, or unclear passages.

If you have a situation with a large number of special cases, it is often easier to just program something to summarize the data and let the user make a decision.
ReplyDelete
Replies

Add comment

Pages

Wednesday, February 1, 2012

Does this exist? (Text Analysis and Manipulation Software)

9 comments: