I've been thinking about doing something extremely silly for a while now but doing it would require the use of tools that might not actually exist. Specifically computer programs.
I know almost nothing of programming. I know enough to know that I don't know enough. Often times I've thought something would be easy only to be informed that it would be impossible, or thought something would be impossible only to be informed not only was it incredibly easy, it was already done. So I'm just going to ask, here's some programs I'd like to play with. Do they exist?
The first is a search and replace function that takes into account morphology and syntax.
Say one wanted to replace every instance of “have” in a document with an instance of “possess”. The first hurdle would be that many of the cases aren't going to be have. They're going to be had, or having. You won't find them with a simple search for have, you won't be able to replace them with the word possess. You need to be able to find instances of a word outside of the form given, and replace them with forms of the other word that match.
Second, “have” isn't always used the same way. You don't want, “I have been there many times,” to become “I possess been there many times.” That's where the syntax would come in. A verb auxiliary is different from a verb, and so a distinction can be made. (For that matter “have to” and “have” could, in theory, be automatically distinguished since ordinary “have” is never going to be followed by “to” and “have to” is, I'm pretty sure, never split.)
Or, to switch words for a moment, you don't want to confuse, “This is my land,” with, “I'm going to land the plane.”
I know that based on how something is used in a sentence computers can give a best guess at what part of speech it is. I have no idea how these programs work or how well their guesses hold up, but I know that something like that exists. What I don't know is whether they've been implemented in a search program where you could look up, say, all of the verb instances of the word “land.”
At some point it's going to break down because there will be things that can't be divided up by morphology and syntax, but if enough of it works right the things that work wrong should be part of the fun. The thing is, I don't know if any of this exists.
Is there a search function that takes into account morphology and syntax? Is there a replace function to go with it?
Also, if there is, could I, say, use it to get a list of the transitive verbs in a document ranked from most frequent to least frequent? Intransitive? Verbs that go both ways? Nouns? Adjectives? Adverbs? Stuff? That might be an interesting thing to have.
Mostly changing direction, since I'm already talking about running analysis on a document using a program that has some way of working at syntax, would it be possible to, say, get a list of all the adjectives used to modify a given noun? (Preferably ranked from most frequent to least frequent.) Verbs with which it's the subject? Verbs with which it's the object? (Would probably want to subdivide verbs into active and passive.)
Is text parsing at a level where it can do a less than crap job of figuring out which noun a pronoun is associated with and use that as part of the gathering above?
If I did gather this kind of data, by whatever means, for multiple subjects is there something I could use to run an analysis on the multiple data sets looking for patterns? (Elements that often are used together, chords of text basically.) So that after comparing all the data one could have a set of patterns and be able to say, for example, “Set X is associated with patterns 1, 3, and 7”?
Perhaps I should be a bit less abstract.
Say I wrote by leaning heavily on stereotypes and as a result I used the same language to describe all of my male characters. I may not use every word in that with every male character, but it should theoretically be possible to discern a pattern because the words tend to be used together in the male characters, and tend not to be used in the female ones. More than that, it should be possible to discover that pattern even if you don't know what you're looking for.
Since the language of maleness is being used on some characters (male ones) but not others (female ones) looking for words that tended to occur together should result in you finding that language and being able to classify characters into the category that is described using it, and the category that is not. When you looked at which characters fell into which categories you'd probably realize that it had to do with gender, but the process itself should be able to be done with no such thinking.
Of course, that wouldn't be the only thing. Maybe all evil characters are described using a set of stereotypical adjectives and verbs as well. Perhaps the same goes for leaders. A character might be an evil male leader. They might be the only evil male leader, in which case the entire set of words associated with them should be fairly unique, but within it one would be able to find words from three separate patterns (male, evil, leader.) A female good leader should be recognizable as being in one, and only one, of the same patterns as the male evil leader.
In reality it wouldn't be nearly that clear cut or that easy to detect, there would be a lot of noise to deal with not to mention the problems of small sample size and doubtless many headaches, but as I mentioned before, I'm thinking of doing silly things, so I'm not actually wondering if anything exists that does this sort of thing well. I'm wondering if something exists that does this sort of thing at all.
And as with the search and replace, mistakes might be part of the fun, but only if it works right to some degree.
Now I started with nouns, and then moved into characters for a specific example, but I'd be interested in if something similar could be done with other parts of speech. What objects does this verb take? What are some other verbs that take a similar set of objects. Ditto for subjects. What nouns does this adjective modify? What other adjectives modify a similar set of nouns?
That sort of thing.
So, I've described multiple programs here. Do any of them exist?