SMU Office of Research – Love them or loathe them, cities are here to stay. In 2007, the world reached a major milestone, with the urban population outnumbering the rural population for the first time in human history. Since then, cities have grown at the rate of 60 million people per year and are set to house 66 percent of the world population by 2050.
On the one hand, cities are engines of economic success, with the top 750 cities disproportionately accounting for 57% of the global GDP. On the other hand, unplanned urbanisation can put a strain on infrastructure and social services, encouraging the growth of slums and exacerbating inequality. But what if the solution to these problems lay within the city itself?
Low power sensors embedded in the environment can now automatically collect data ranging from pollutants and traffic levels to whether garbage bins are full. City dwellers themselves are rich sources of information, generating hundreds of thousands of geo-tagged photos, tweets and other social media updates, all in real-time.
“The challenge now is to develop reliable systems that can sense, transport and make sense of these diverse streams of data and thus make urban services smarter and personalised,” said Singapore Management University (SMU) Associate Professor Archan Misra, a Director of the LiveLabs Urban Lifestyle Innovation Platform and co-General Chair of the International Conference on Distributed Computing and Networking (ICDCN) held in Singapore on 4-7 January, 2016.
What happened? Let’s ask social media
Delivering a keynote speech titled ‘Social Sensing in Urban Spaces’ was Professor Tarek Abdelzaher, Willett Faculty Scholar at the Department of Computer Science, University of Illinois at Urbana Champaign. An embedded systems expert by training, Professor Abdelzaher shared his recent findings on what at first seems like a radically different field: social media analytics.
“It’s an exciting arena to work in because we are essentially inundated by the tremendous amount of data from social networks; data that we can exploit for a wide range of applications,” he said. “We are limited, however, by the way we process information, which has not changed even though the amount of information has grown exponentially.”
Drawing on his theoretical background in networking and signal processing, Professor Abdelzaher likened the goal of extracting information from social networks to designing a network protocol stack, the hierarchical processes that enable applications to run on physical networks of wires. But rather than transporting or retrieving information, the ‘protocol stack’ in a social network scenario is designed to help make data-based decisions.
“One application we built is an anomaly explanation service that can compare data from sensors and automatically mine the related tweets to provide a summarised explanation of what happened,” Professor Abdelzaher elaborated.
“For example, when a spike in radioactive activity was detected near the Fukushima nuclear plant, our algorithm identified a tweet from TEPCO, the company that was decommissioning the reactor, as the most important. When we translated the tweet from Japanese to English, we found that it was about a leak in the basement, which explained the spike detected by radiation sensors.”
Simpler than it seems
With 500 million tweets being generated in a single day, the ability of Professor Abdelzaher’s algorithm to zero in on precisely the right tweet might seem like magic. In fact, many researchers studying Twitter data have often needed to use complex techniques such as natural language processing (NLP) to take into account the meaning or semantics of the text being studied. But Professor Abdelzaher said that his approach is much simpler and cheaper to implement.
“The two sentences ‘John killed Mary’ and ‘Mary killed John’ have totally different meanings but both have the same three words at the same frequency: each word occurred once. So human speech is often considered a rich medium where not only the frequency but the relative position of the word is important,” he explained.
The first step of a simple frequency-based and NLP-independent approach, Professor Abdelzaher said, is to consider pairs of keywords. The English language has about 10,000 commonly used words, making it challenging to separate out hundreds of different events from within those words. However, by considering keyword pairs instead of single keywords, the range of possible pairs increases to 100 million.
“The space of possible keyword pairs then becomes so much larger than the number of events that you want to isolate that we can almost guarantee that if you have two different events they will be associated with different salient keyword pairs,” he said.
Once tweets with a specified keyword pair have been linked to an event, the next step is to determine whether the statement in the tweet is true or false, assigning it a binary 1 or 0 value. The trouble is that people are a noisy communication medium, sometimes providing false information or denying things that have happened, and thereby distorting the reliability of the social media ‘signal’.
To figure out which claims to believe, the researchers used an algorithm called maximum likelihood estimation, which works by assuming random values for reliability and correctness, and then iterating over these values to find the combined set of values that maximise the likelihood of those particular sources saying those particular claims.
Professor Abdelzaher compared the results with human ratings of the tweets and found that the algorithm got it right 80 percent of the time. “In fact, we ended up with a tool that can beat the news in certain cases, reporting events based on social media data even before news outlets like the BBC or CNN,” he added.
Re-framing social media data
Having shown that it is indeed possible to develop many useful event detection applications based on social media data, Professor Abdelzaher believes that communication network researchers like himself have much to contribute to the emerging field of social media analytics.
“What I’ve presented is just scratching the surface; there are many other interesting techniques from estimation theory and information theory that can be put to bear on social media data. With those tools, deeper theoretical analysis can be done and additional techniques from signal processing can be brought in to understand social media,” he concluded.
Into its 17th edition, ICDCN is an international conference that presents the latest advances in the fields of distributed computing and communication networks. Focused this year on combining both ‘hard’ sensors and social media data to solve urban challenges, the conference, hosted at SMU, featured workshops, presentations and keynote speeches from leading academics.