In this week, I explored how to analyse text with variety of tools.
When I looking at the Newspaper articles of The Great War vs First World War & World War I, I found in 1916 there are a few data about World War I. But logically, people won’t call it as ‘World War I’ before WW2 occurs. As the reason presented in course handbook, that’s due to users tagged ‘World War I’ when they viewing other articles.
But, I think it shouldn’t happened, when making QueryPic, users tags should be ignored, or other methods to exclude them to enhance the accuracy.
When I using Google Ngram Viewer tool to compare ‘China,Japan,America’ in English, and ‘China,Japan,USA’ in English as well, I found the ‘USA’ comes up very late, just until 1940s, but the ‘America’ are always very high. Does ‘USA’ just founded in 1940s? Definitely not. I’ll explore it further.
I found ‘wordcounter’ www.databasic.io is quite useful to present the ‘most’ and ‘least’. Especially for journalist, they love to write what the politician says most often. And when I analyse journal articles, speeches, it’s also quite useful. I think I will use it for my project, as it can count what people said most, I can analyse the situation at that moment (Tian’anmen massacre).
SameDiff is also very interesting, it’s a tool that can help compare text. Yeah, it’s also useful for journalist.
http://voyant-tools.org/ can create the ‘cloud’, actually this kind of tool has also appeared in some social media. In Sina Weibo, users can use a plug-in to create their unique image, usually some words that they use most often in their post.
http://voyant-tools.org/ and SameDiff can also customise exclude certain words that you don’t want (or you only want) to be counted. It can give user a more clear vision.