How Would A Robot Read a Novel?

Picture 3

Last week, I went to a rather interesting talk at the LSE titled ‘How Would a Robot Read a Novel?’. I was introduced to a software, primarily used in the social sciences, called Alceste (note: this, and many other sites I’ve linked to in this post, are Google-translated pages, from the originals which are in French. There seems to be surprisingly little about it on the web in English). What Alceste does is look for repetitions of co-occurrences of words over a large volume of text to assess patterns. In the social sciences, it is used (still in only a few places, and in a limited number of cases at that) to detect instances of bias in surveys. Research has apparently shown that when words occur in the same pattern repeatedly, it is rarely random.

Alceste doesn’t understand meaning, and makes no pretenses about trying to do so. It was created by Max Reinert of the National Centre for Scientific Research (CNRS) in France, and is now marketed by a company called Image that holds all rights to it, from what we were given to understand.

Anyway, now that I’ve given you the context, let me move on to explaining what was really interesting about the talk. Dr. Kavita Abraham, a researcher at the LSE’s Methodology Institute, used Alceste to analyse a novel called the Kilburn Social Club by Robert Hudson. It is worth noting here that when Alceste was introduced as having been used to assess some literary works earlier as an experiment, members of the audience were easily able to identify the books as being Oliver Twist and Moby Dick. With the Kilburn Social Club, Dr. Robert Hudson (a history academic-turned-author) admitted that Alceste’s analysis matched the pattern of the story he started out intending to write, in that the words used were seen as generally being grouped around 4 themes (16% descriptive, 12% football, 22% finance and 50% relationships). So it could be used, hypothetically, during the process of writing to ensure that a book wasn’t skewed heavily in one or the other direction.

Dr. Hudson clearly meant ‘hypothetically’, though, because the truth is, as we discussed after the talk, we don’t really need Alceste to tell readers about patterns in books. Why would you want to reduce a work of art to a mere jumble of statistically co-relating groups of words? People read literary works FOR that element of bias (I think James is writing a post about how opinion – bias, if you must – is in fact often not given the respect it deserves in today’s world). A quote of Mark Twain’s was proffered by one of the panel members: ‘A classic is something that everyone wants to have read but no one wants to read’, but I’d argue that at a stretch you can extend it to summarizing business books  – the way Kevin Duncan does on his blog, for example. It’s useful to time-starved people who want to be able to speak intelligently about a book and learn the distilled lessons from it, but who don’t have the time to wade through it in its entirety. You just can’t do that with novels, though! Here’s an example of how Alceste summarized that potboiler of potboilers, The Da Vinci Code. It’s quite a laugh.

Picture 1Picture 2

One of the issues that was left simmering in my mind as I left the venue is that there are so many technologies we’re introduced to on a daily basis that many of us perhaps do not really question the need for – probably even more common in the case of clients. Is ‘I want a social media’ really still an accepted statement?

Google Buzz is being debated upon as either a highly intrusive or potentially highly social application, while right here at Made by Many we’re arguing the benefits of using Yammer at work versus plain old Twitter. The question isn’t what we can do with it, as in the case of Alceste, where it has been accepted that it is really only useful to the social sciences because that discipline is based on the removal of bias. The question is do we need it at all?

(A PDF of the talk, for those interested, is now available here).

See also:

  • No similar posts

About the author

Anjali Ramachandran is a strategist/planner who loves all things interesting, mostly digital.

  • Comments (8)

    1. Haha! That summary of the Da Vinci Code is substantially more interesting from a literary point of view than the couple of pages of it that I’ve read.

      One thing that is worth noting is that repetition is a very powerful literary technique. We are taught to avoid it in our writing, because it is unsettling and, well, repetitious when used carelessly, but look at Kafka’s or Gogol’s work for examples of where repetition is used to stunning and quite deliberate effect. Kafka, in particular, uses it precisely to unsettle the reader.

      For years, Kafka’s repetition was invisible to readers who had no German because Kafka’s translators (Willa and Edwin Muir) silently edited him by replacing the repetition with a series of what they considered to be synonyms. It’s only very recently that new translations that have respect for the original text have emerged.

      The big question, then, is how does a computer program tell the difference between genuine literary repetition and just pisspoor writing like Dan Brown’s?

      I think an analysis of Proust’s In Search of Lost Time would show a great deal of repetition as well. It would be fascinating to see the summary Alceste produced for that!

    2. James – thanks for the comment! Very interesting, what you said about Kafka and Gogol’s use of repetition as a literary technique. And you brought out a very valid point about words and meanings literally getting lost in translation.

      I wonder if there are any books in the last few years (post 1950ish) that use that technique.

      In response to your question – it was one of the things that was asked after the talk – nope, this computer program can’t tell the difference. It doesn’t understand meaning at all. It uses the binary system to allocate values to different words, apparently.

      Which is why we really shouldn’t get into analysing literary works, other than for a bit of fun!

    3. Off the top of my head, I would say that you would find a good deal of repetition in the work of Roberto Bolaño, Javier Marías, W.G. Sebald and Cormac McCarthy among others since 1950. I’m specifically thinking of repetition at the sentence level; a deliberate use of the same word multiple times in a reasonably short space of time for effect, not necessarily just words that are found regularly throughout the text.

      The frustrating thing about things being lost in translation is that so much of this is not because of an inherent difficulty in preserving all the possible meanings of the source language in the target language, but that there is a significant movement in translation that finds it acceptable, even laudable, to ‘clarify’ meaning, or to silently ‘improve’ a sentence. This is not being something being lost, but rather being stolen in translation.

    4. I just remembered the Times Labs’ Book Scraper thing. Here’s a look at Joyce’s vocabulary, for example: http://labs.timesonline.co.uk/bookscraper/authors/joyce-james

      It’s perhaps not surprising that most of the most important words are the names of characters, but it shows another problem that such an analyser would find. (Just about the only significant non-character word in that cloud is ‘Dublin’, and you could certainly make an argument that Dublin was one of Joyce’s most important characters. So maybe there are none!)

      Unfortunately, their analysis of Kafka – http://labs.timesonline.co.uk/bookscraper/authors/kafka-franz – only covers Metamorphosis and not The Trial or The Castle, where the repetition effects are most in evidence.

    5. super interesting. funny i worked a lot with Kavita back at the glory days at the LSE, she’s very good!

      Alceste is VERY old and dated thematic analysis tool, which i’m surprised they are still using – even wordle (http://www.wordle.net/) can give you more interesting results, at least visually.

      rock on

    6. James – thanks for the pointer to the Times Labs thing. How old is it? I guess the thing about Alceste is it tracks co-occurrences rather than mere occurrences, so that’s how it differs from Times Labs. To your other point, yes it’s sad that translation completely alters the meaning of so many literary works. Personally I find translated works excellent in that they allow me to read authors and books I wouldn’t be able to otherwise. I’m not a critic to the extent that I can make out the lack of repetition (for example) that would have worked so well in the original, but I wish I was. :)

      Asi – hey! interesting to know you worked with Kavita! I don’t know if you took a look at the PDF of the presentation they gave, but they in fact did mention Wordle. I’m mildly surprised why the LSE is only using Alceste now. There was some noise about how it was very proprietary and the actual know-how couldn’t be released – maybe they’re only catching up now?

    7. Hi hi,

      And thank you very much for the write-up. You’re completely right, basically – no one sensible would hand over any judgement to Alceste.

      On the other hand, being fair to the Robot, I *think* that the summary you show was made by Microsoft Word, not Alceste, which just produced word clusters.

      What I found impressive about Alceste’s thematic clusters was that, instead of assuming everything was about politics, football or business – the outward subjects of the book – its largest collection of words was about relationships. Of course, it needed me to say that this is what the words meant, but it made me think that Alceste produces data that is accurate on a certain crude level, and which it might be possible for humans to use, in certain contexts. The program’s readings of other books were similarly reasonable, on this same crude level.

      I am trying to be careful here – Alceste produces data, not answers. It can crunch through texts (1000 minor novels of the 1930s, for instance) on a scale that would be impossibly time-consuming for a typical reader. I have no idea what things a researcher might find in the data, and I suspect that the data might be more useful to a historian than a literary critic, who would be more explictly concerned with issues of judgement. Does this all sound impossibly woolly?

      Robbie

    8. Hi Robbie – and thanks for your comment. It’s nice to a panelist commenting on my blog post!

      You’re right – the data that Alceste produces is likely to be much more useful to a data analyst/researcher than a literary critic. They may be useful for authors at some crude level, as you say, by assessing how biased or non-biased towards certain emerging themes a book may be, as well. But yes, data rather than literary analysis. No, it doesn’t sound woolly at all!

      Anjali

  • Responses (0)

Leave a comment

Our latest tweets

Categories

Recent comments

  • James Higgs: At some level Kujau wanted the attention, and the same seems to be true of Manning if he is indeed t...
  • William Owen: Sara, you've asked lots of pertinent questions here but I think you’re really asking quite a lot of ...
  • Sara Williams: James, as much as I want to agree with you -- you are right a very good percentage of the time -- th...
  • James Higgs: There is a certain logic to this: people are unlikely to go through a great deal of effo...
  • Tim Malbon: I think we should remember that we are in Afghanistan because its leaders allowed it to be used as a...