In response to:
Dumais, Susan. (2003) “Data-driven approaches to information access.” Cognitive science. 27: 491-524.
I was having a hard time understanding what was going on with LSA, and had to write up a description of what I thought was happened while going through the explination in section 2.1. What I got from the list of steps was as follows:
They create a matrix (glorified spreadsheet) of the count of every word in every document, choose a variable (number of terms or frequency of occurrence in a single doc) above which the count is retained and below which the term is discarded (or is that inverse?). Then they create sets of documents based on the similarity of term frequency.
Then immediately afterward, I realized I was wrong. What LSA is actually about is tracking relationships between words. So with the TOEFL example, it is really a huge accomplishment when viewed in context, but it cuts away at the impressiveness that LSA isn’t able to out-perform the ESL students who actually take the test. That is, LSA doesn’t perform as well at interpreting meaning as a native speaker. It is a step, and it outperforms word-matching, but we aren’t there yet. Given the physician as a synonym for doctor or nurse example, if the LSA dimensions were also controlled for synonymy, it would out perform the ESL students. But the whole point of LSA is that it doesn’t require a thesaurus to control for synonymy. The net effect is, LSA performs at the same level as ESL students, but in different ways. Related to the gmail April Fool’s joke on “Autopilot”, it manes me wonder: if LSA were a student, would it pass it’s classes? After all, the ESL students continue to learn after the test is administered.
I am probably still missing the point on how the magical math works, but it seems to me that physician and doctor would have significantly lower co-occurrence with doctor than it would with nurse, so even though doctor and physician show up in the same contexts, physician and nurse OR doctor and nurse would have higher co-occurrence. Nurse appears some-variable more often in context than either physician or doctor, which seems like it could be a major problem throughout LSA.