« While you wait | Main | Now Listening »

"Reading" without reading

I asked Lotaria if she has already read some books of mine that I lent her. She said no, because here she doesn't have a computer at her disposal.

She explained to me that a suitably programmed computer can read a novel in a few minutes and record the list of all the words contained in the text, in order of frequency. "That way I can have an already completed reading at hand," Lotaria says, "with an incalculable saving of time. What is the reading of a text, in fact, except the recording of certain thematic recurrences, certain insistences of forms and meanings..."

Italo Calvino, If on a winter's night a traveler

Derek and I have been laying the full-court press on all things related to CCC Online this week, in the hopes of rolling the site out publicly with the release of the first issue of this year's volume (57.1), and we're getting pretty close. Upon its release, the site will have archived the past four years of essays, and while that doesn't sound like a lot, believe me when I say that it has been. There's been a great deal of information to compile, and we've also had to design a workflow in the process, one that will enable us to continue working backwards in time.

Anyhow, one of the major features of the site is ready to roll. In addition to publishing the metadata on each article, we've been generating some additional material in the form of keywords. Beginning with this year, CCC authors will supply a set of keywords for their articles, but as you can imagine, trying to track down the authors of 50-odd years of articles and get keywords didn't strike us as a winning proposition.

And so, like Lotaria above, we looked for technological assistance. Inspired in part by the work of people like Cameron Marlow and Anjo Anjewierden, what we needed was a way to "read" these essays, reducing them to a set of 10-15 keywords, a way that wasn't prohibitive in terms of time or labor. After much searching, despairing, tweaking, and yes, whining, we ended up with a couple of Perl scripts that seem to be doing the trick. You could almost hear the relief, I imagine.

The results of our text parsing are valuable in and of themselves, I think, and they show up for the individual entries on CCC Online, under the heading of Tags. While they can't fully account for a given article's complexity or nuance, we're operating according to a principle a lot like the power law--our attitude is that the majority of an essay's message is concentrated in the handful of words that appear just below the threshhold of articles, pronouns, prepositions, etc. They represent that "thematic recurrence" or the "insistences of meaning."

For instance, in Diana George's essay, which I mentioned a few days ago, we isolated close to 1600 nouns and noun phrases, appearing a total of 3500 times. (These are really rough numbers, for reasons that I could explain if anyone's really interested.) Now, the top 1% of those noun/phrase/s, or about 16 of them, account for around 500 appearances (approx. 15%). Expand the selection to 5% (80 nouns), and the appearances jump to 1200 (almost 33%). 10% of the words (160) gives us a little less than half at 1600 (about 45%). And 20% (320), a magic percentage for power laws, yields 2100 instances, or around 60%. This may not be interesting to anyone but me, but while it doesn't quite match up with the power law, it's close enough to be suggestive. And the roughness of my numbers is rough in the right direction for the claim I could make.

Here's where it gets really cool, though. We've generated lists of keywords for all of the articles published in CCC over the past four years, and placed those keywords on the individual pages themselves. Because we're using MT to publish these entries, though, we've made them available for services like CiteULike and del.icio.us. And so, we've established a CCC Online account at del.icio.us (http://del.icio.us/ccco/), where we've first bookmarked all of the articles from the last four years, and then used our keywords as tags for the articles themselves. And the keywords on each entry at CCCO are links to our del.icio.us page for that tag.

For those unfamiliar with del.icio.us, I recommend scrolling down and finding Options at the bottom of the right-hand column, and starting with "View as Cloud," "Sort by Alpha," and "Show Bundles." The option that appears in black is the one that's active. The cloud uses color and size to indicate which tags are most frequent, and we've separated out the issues themselves into a separate bundle. We also added tags for CCCC Chair's Addresses and the Braddock Award winners. Spaces aren't permitted, and so you'll notice that we're doing the WikiWord thing for phrases.

Tags at the top and bottom of the frequency list are less than optimal, of course. "Students" appears as one of the top 10 in 69 of the 84 essays we've tagged thus far, for example, which isn't particularly useful except as an example of the kinds of concerns most likely to appear in CCC. And at the bottom are a mix of tags, some of which will probably rise as we expand the range and others which will end up being something like Amazon's statistically improbable phrases (SIPs). The range in the middle, though, we hope will help researchers in our field by seeding their bibliographic work (it is only a single journal, after all).

More importantly, though, I think that del.icio.us provides us with the beginnings of a map of the journal--whether it's extrapolable to the field as a whole I'm reserving judgment about, but I'm excited about the possibility. It's an eminently searchable map, as well as one that permits the kind of exploration that isn't nearly as convenient otherwise. There's plenty more to say about it, I'm sure, but right now, I kind of want to just sit back and feel a little pride.

So, yeah, that's part of what we've been up to.


Listed below are links to weblogs that reference "Reading" without reading:

» automatic academic reading ... from mediatope II
... this sounds quite cool to me: having an humanities academic digital journal (CCC online) "read" by some PERL scripts which identify 15 "keywords" per essay, transforming these keywords into tags and loading them up to del.icio.us to create all... [Read More]


Wow! Impressive!! Looking forward to reading what you've put together on the Cs site. As I can't imagine the amount of work that required--kudos to you!


Re: the del.icio.us stuff


Thanks! We're pretty damn pleased with ourselves...

This is exciting news. I've been asked to become "web editor" for a journal; if you plan to share code, and I applied similar meta-work to the other journal, maybe we could talk a few others into doing it, and start moving toward a critical mass of searchable composition scholarship.

That's exactly my hope, that other journals will hop on board this train as well. In fact, that was part of the vision that I pitched to NCTE. I'm more than happy to offer code, insight about the process, whatever would be useful...just let me know.

If you want to be able to search your data at all, I strongly suggest you sync your data over to Simpy (it knows how to pull data from del.icio.us). Simpy is powered by a real full-text search engine, which you will need if you want to be able to cut across your links and tags in multiple dimensions (i.e. not just by 1 tag and not just by "tag intersection). del.icio.us simply can't do that. Simpy can, and people tend to love that.