ICCS 2016 Main Track (MT) Session 9
Time and Date: 14:30 - 16:10 on 6th June 2016
Room: Toucan
Chair: Ana Cortes
436 | Identifying Venues for Female Commercial Sex Work Using Spatial Analysis of Geocoded Advertisements [abstract] Abstract: Despite being widely visible on the web, Internet-promoted commercial sex work has so far attracted limited attention from the side of researchers. Current studies outline the issues that new forms of sex work are associated with, however, very little is known to date about their spatial manifestation. In this research we follow the environmental perspective in spatial analysis of crime and deviance with the assumption that the location of venues for provision of commercial sex work can be modeled via the algorithms trained on the distribution of possible correlates in the proximity to the existing venues. Visualization of the acquired results is presented herein along with the errors and score metrics for evaluation of the applicability of specific methods of machine learning. The paper is concluded with the estimation of potential extensions and peculiarities of data used in the research. |
Daniil Voloshin, Ivan Derevitskiy, Ksenia Mukhina, Vladislav Karbovskii |
21 | RTPMF: Leveraging User and Message Embeddings for Retweeting Behavior Prediction [abstract] Abstract: Understanding retweeting mechanism and predicting retweeting behavior is an important and valuable task in user behavior analysis. In this paper, aiming at providing a general method for improving retweeting behavior prediction performance, we propose a probabilistic matrix factorization model (RTPMF) incorporating user social network information and message semantic relationship. The contributions of this paper are three-fold: (1) We convert predicting user retweeting behavior problem to solve a probabilistic matrix factorization problem; (2) Following the intuition that user social network relationship will affect the retweeting behavior, we extensively study how to model social information to improve the prediction performance; and (3) We also incorporate message semantic embedding to constrain the objective function by making a full use of additional the messages' content-based and structure-based features. The empirical results and analysis demonstrate that our method significantly outperform the state-of-the-art approaches. |
Jiguang Liang, Bo Jiang, Rongchao Yin, Chonghua Wang, Jianlong Tan, Shuo Bai |
75 | Leveraging Latent Sentiment Constraint in Probabilistic Matrix Factorization for Cross-domain Sentiment Classification [abstract] Abstract: Sentiment analysis is concerned with classifying a subjective text into positive or negative according to the opinion expressed in it. The performance of traditional sentiment classification algorithms rely heavily on manually labeled training data. However, not every domain has the labeled data because the labeling work is time-consuming and expensive. In this paper, we propose a latent sentiment factorization (LSF) algorithm based on probabilistic matrix factorization technique for cross-domain sentiment classification. LSF works in the setting where there are only labeled data in the source domain and unlabeled data in the target domain. It bridges the gap between domains by exploiting the sentiment correlations between domain-shared and domain-specific words in a two-dimensional sentiment space. Experimental results demonstrate the superiority of our method over the state-of-the-art approaches. |
Jiguang Liang, Kai Zhang, Xiaofei Zhou, Yue Hu, Jianlong Tan, Shuo Bai |
91 | Identifying Users across Different Sites using Usernames [abstract] Abstract: Identifying users across different sites is to find the accounts that belong to the same individual. The problem is fundamental and important, and its results can benefit many applications such as social recommendation. Observing that 1) usernames are essential elements for all sites; 2) most users have limited number of usernames on the Internet; 3) usernames carries information that reflect an individual’s characteristics and habits etc., this paper tries to identify users based on username similarity. Specifically, we introduce the self-information vector model to integrate our proposed content and pattern features extracted from usernames into vectors. In this paper, we define two usernames’ similarity as the cosine similarity between their self-information vectors. We further propose an abbreviation detection method to discover the initialism phenomenon in usernames, which can improve our user identification results. Experimental results on real-world username sets show that we can achieve 86.19% precision rate, 68.53% recall rate and 76.21% F1-measure in average, which is better than the state-of-the-art work. |
Yubin Wang, Tingwen Liu, Qingfeng Tan, Jinqiao Shi, Li Guo |
441 | A Hybrid Human-Computer Approach to the Extraction of Scientific Facts from the Literature [abstract] Abstract: A wealth of valuable data is locked within the millions of research articles published each year. Reading and extracting pertinent information from those articles has become an unmanageable task for scientists. This problem hinders scientific progress by making it hard to build on results buried in literature. Moreover, these data are loosely structured, encoded in manuscripts of various formats, embedded in different content types, and are, in general, not machine accessible. We present a hybrid human-computer solution for semi-automatically extracting scientific facts from literature. This solution combines an automated discovery, download, and extraction phase with a semi-expert crowd assembled from students to extract specific scientific facts. To evaluate our approach we apply it to a particularly challenging molecular engineering scenario, extraction of a polymer property: the Flory-Huggins interaction parameter. We demonstrate useful contributions to a comprehensive database of polymer properties. |
Roselyne Tchoua, Kyle Chard, Debra Audus, Jian Qin, Juan de Pablo, Ian Foster |