I wasn’t really planning on starting my next post with a one word title starting with V like the previous one. This happened purely by accident and I will provide the history first.
Some of my close friends are aware that I am a big fan of Project Euler. If a problem is hard, I focus on it in my spare time to solve it. If I am done with it, I used to have several unfinished problems in the past. But after reaching 100%, I mostly have to wait for the weekly problems. And right now, the season has ended with a break till 1st September. To entertain myself, I look on the web for people gossiping about Project Euler. In one of those moments, I stumbled upon the website of Mr Anand. I visited this website earlier and one more time today. I would have almost not recognized who that was until I saw the About Me page and his nicknames. He was a 1 yr junior from the same hostel when I was doing my BTech. I tried contacting him first to congratulate that he was working Project Euler, second to admonish him to not post the solutions on the web as it’s against the spirit of Project Euler. I never received a reply but probably some spam filter ate away my email.
Anyway, now let’s come to the topic. So, in one of those poking around the web to see what people are talking about Project Euler, I stumbled upon this website again. I was looking at some of the stuff he was working on that eventually took me to the Visualizations blog of Gramener. I first looked at the language of tweets and noted that love is being more used in the evenings.
What caught my attention was the Mahabharatha visualization app. First, I am all for visualization. I even created a very good word cloud app for public to use and I get lot of hits. Forget about word cloud, I even created an image cloud using Amazon product data offered as a webservice. However, visualization techniques are not the end of data analysis, it’s just the beginning. They provide a direction to someone who is lost. It offers a way to prioritize what to focus on with the limited resources one has.
The bigger problem I see with any visualization techniques is coming to a conclusion too soon. Let’s take the above Mahabharatha visualization app. If I understand correctly, the way it’s developed is, they parsed all the freely available text, created a set of records of the form chapter, section, character-name, word number. Here, word number indicates the position of the word from the beginning of the section.
Let me digress a bit regarding Big Data. While they claim 1.8 million words is sort of a big data, I have to disagree. I disagree because I have a cron job that processes more than a million products from Overstock’s product feed and dump the data into SQLite and I implemented my own rudimentary full-text-search and all this processing takes less than a few minutes on my 2.8 GHz Core i7 machine. So, I know it’s not much volume.
Anyway, once they have that data, they present this info in a very nice visual form that can be accessed by anyone via a browser. The statement “This makes it easy to see where characters speak together (e.g. where does Kunti throw away Karna? Where does she meet him again?” caught my attention. And so, I immediately checked that and something caught my attention. Mahabharatha has 18 chapters. See the below image for Kunti and Karna together (I am only showing from chapter 7 which is sufficient for what I want to say).
The blue dots are for Kunti and the other color is Karna. Notice that the last time you see both these dots appear together is in Chapter 18, Section 4. Going by their statement “This makes it easy to see where characters speak together” one might quickly draw the conclusion that these two were talking to each other till the very end of this epic. There are a few problems with this. One, even if two names occur in the same section, there is no way to know that the conversation is actually between them. Next, just because there is a name, doesn’t mean that that person is actually talking! Someone else could be just referring the name of that person.
Their blog also says “But the growing field of text analytics and text visualisation tell us that there’s a lot more structure to plain text than one might think.” Yes, but it’s very very difficult to figure out by just tokenizing the text. Yes, I know information retrieval has sound math (td-idf) to it. But when it comes to text analysis visualization is not the main obstacle, but the sophistication of algorithms that can actually understand the context.
The reason why this immediately caught my attention is actually because I know a bit of that great epic. So, I know that Karna actually dies in the Kurukshethra and so there is no way for Kunti and Karna to be talking till the end! In fact, Karna actually dies in Chapter 8, Section 91, several chapters earlier than the end. So, because I knew Mahabharatha already, I could easily spot this anomaly. For someone who is not familiar, how can they make a wise decision even when all the data is presented with good visualization.
So, points I want to make are
- Visualization is important
- But even more important is, cleansed data
- Cleansing data is relatively easy for numbers occuring in transactions
- But not so when it comes to text analysis
Also, in my personal experience of using reports, those who use the reports are the best people who can suggest what changes are required to the reports to help them better in their work. At work, we have something called a bugdb that tracks bugs. There are frontend ad-hoc query tools for this system. But me as a manager, if I want to know more about a specific metric, while I could use one of the stock reports, it’s my ability to formulate complex queries and get the result in the exact way I want makes it easy for me to respond back to the answer much faster, with confidence and understanding of what went in coming up with that final magic number.
So I think as the cloud picks up and more and more transactional data is on the cloud, tools that provide the ability to easily create reports the way a non-technical (but sound in statistics or a number-savvy MBA type) person wants it would have a great demand. Of-course, offering reasonable response time is the key for wider adoption of such tools.