Selected publications in Data Mining, Machine Learning and Information Retrieval Venues.
We address the problem of customer retention (churn) in applications installed on over the top (OTT) streaming devices. In the first part of our work, we analyze various behavioral characteristics of users that drive application usage. By examining a variety of statistical measures, we answer the following questions:(1) how do users allocate time across various applications?,(2) how consistently do users engage with their devices? and (3) how likely are dormant users liable to becoming active again? In the second part, we leverage these insights to design interpretable churn prediction models that learn the latent characteristics of users by prioritizing the specifications of the users. Specifically, we propose the following models:(1) Attention LSTM (ALSTM), where churn prediction is done using a single level of attention by weighting on individual time frames (temporal-level attention) and (2) Neural Churn Prediction Model (NCPM), a more comprehensive model that uses two levels of attentions, one for measuring the temporality of each feature and another to measure the influence across features (feature-level attention). Using a series of experiments, we show that our models provide good churn prediction accuracy with interpretable reasoning. We believe that the data analysis, feature engineering and modeling techniques presented in this work can help organizations better understand the reason behind user churn on OTT devices.
Coming Soon here
In the era of big data, online doctor review platforms, which enable patients to give feedback to their doctors, have become one of the most important components in healthcare systems. On one hand, they help patients to choose their doctors based on the experience of others. On the other hand, they help doctors to improve the quality of their service. Moreover, they provide important sources for us to discover common concerns of patients and existing problems in clinics, which potentially improve current healthcare systems. In this paper, we systematically investigate the dataset from one of such review platform, namely, ratemds.com, where each review for a doctor comes with an overall rating and ratings of four different aspects. A comprehensive statistical analysis is conducted first for reviews, ratings, and doctors. Then, we explore the content of reviews by extracting latent topics related to different aspects with unsupervised topic modeling techniques. As the core component of this paper, we propose a multi-task learning framework for the document-level multi-aspect sentiment classification. This task helps us to not only recover missing aspect-level ratings and detect inconsistent rating scores but also identify aspect-keywords for a given review based on ratings. The proposed model takes both features of doctors and aspect-keywords into consideration. Extensive experiments have been conducted on two subsets of ratemds dataset to demonstrate the effectiveness of the proposed model.
Coming Soon here
Several complex tasks that arise in organizations can be simplified by mapping them into a matrix completion problem. In this paper, we address a key challenge faced by our company: predicting the efficiency of artists in rendering visual effects (VFX) in film shots. We tackle this challenge by using a two-fold approach: first, we transform this task into a constrained matrix completion problem with entries bounded in the unit interval [0, 1]; second, we propose two novel matrix factorization models that leverage our knowledge of the VFX environment. Our first approach, expertise matrix factorization (EMF), is an interpretable method that structures the latent factors as weighted user-item interplay. The second one, survival matrix factorization (SMF), is instead a probabilistic model for the underlying process defining employees’ efficiencies. We show the effectiveness of our proposed models by extensive numerical tests on our VFX dataset and two additional datasets with values that are also bounded in the [0, 1] interval.
The code can be downloaded from here
There are four datasets from Amazon reviews that we experimented with: Musical_Instrument, Electronics, Movies_and_TV, and Books. Here, you can download the trained doc2vec embeddings and other related files of Musical Instrument dataset here
Recommendation in the modern world is not only about capturing the interaction between users and items, but also about understanding the relationship between items. Besides improving the quality of recommendation, it enables the generation of candidate items that can serve as substitutes and supplements of another item. For example, when recommending Xbox, PS4 could be a logical substitute and the supplements could be items such as game controllers, surround system, and travel case. Therefore, given a network of items, our objective is to learn their content features such that they explain the relationship between items in terms of substitutes and supplements. To achieve this, we propose a generative deep learning model that links two variational autoencoders using a connector neural network to create Linked Variational Autoencoder (LVA). LVA learns the latent features of items by conditioning on the observed relationship between items. Using a rigorous series of experiments, we show that LVA significantly outperforms other representative and state-of-the-art baseline methods in terms of prediction accuracy. We then extend LVA by incorporating collaborative filtering (CF) to create CLVA that captures the implicit relationship between users and items. By comparing CLVA with LVA we show that inducing CF-based features greatly improve the recommendation quality of substitutable and supplementary items on a user level.
Modeling spillover effects from observational data is an important problem in economics, business, and other fields of research. It helps us infer the causality between two seemingly unrelated set of events. For example, if consumer spending in the United States declines, it has spillover effects on economies that depend on the U.S. as their largest export market. In this paper, we aim to infer the causation that results in spillover effects between pairs of entities (or units); we call this effect as paired spillover. To achieve this, we leverage the recent developments in variational inference and deep learning techniques to propose a generative model called Linked Causal Variational Autoencoder (LCVA). Similar to variational autoencoders (VAE), LCVA incorporates an encoder neural network to learn the latent attributes and a decoder network to reconstruct the inputs. However, unlike VAE, LCVA treats the latent attributes as confounders that are assumed to affect both the treatment and the outcome of units. Specifically, given a pair of units u1 and u2, their individual treatment and outcomes, the encoder network of LCVA samples the confounders by conditioning on the observed covariates of u1, the treatments of both u1 and u2 and the outcome of u1. Once inferred, the latent attributes (or confounders) of u captures the spillover effect of u2 on u1. Using a network of users from job training dataset (LaLonde (1986)) and co-purchase dataset from Amazon e-commerce domain, we show that LCVA is significantly more robust than existing methods in capturing spillover effects.
The growth of social media has created an open web where people freely share their opinion and even discuss sensitive subjects in online forums. Forums such as Reddit help support seekers by serving as a portal for open discussions for various stigmatized subjects such as rape. This paper investigates the potential roles of online forums and if such forums provide intended resources to the people who seek support. Specifically, the open nature of forums allows us to study how online users respond to seeker’s queries or needs; through their response, we attempt to assess the range of topics covered by responders in regards to the issues, concerns and, obstacles faced by the victims of rape and sexual abuse, using rape-related posts from Reddit. We employ natural language processing techniques to extract topics of responses, examine how diverse these topics are to answer research questions such as whether responses are limited to emotional support; if not, what other topics are; what the diversity of topics manifests; and how online response differs from traditional response found in a physical world.
The rapid evolution of social network platforms such as Twitter, Facebook and Instagram has resulted in heterogeneous networks with complex social interactions. Despite providing a rich source of information, the dimensionality of features in these networks pose severa challenges to machine learning tasks such as personalization, prediction and recommendation. Therefore, it is important to ask the question “how to capture such complex interactions between users in simplified dimensions”. To answer this question, this article explores network representation learning (NRL), where the objective is to model the complex high dimensional interactions between nodes in a reduced feature space, while simultaneously capturing the neighborhood similarity and community membership.
Online reviews have become an inevitable part of a consumer’s decision making process, where the likelihood of purchase not only depends on the product’s overall rating, but also on the description of its aspects. Therefore, e-commerce websites such as Amazon, Walmart and Staples constantly encourage users to write good quality reviews and categorically summarize different facets of the product. However, despite such efforts, it takes a significant effort to skim through thousands of reviews and look for answers that addresses the query of consumers. For example, a gamer might be interested in buying a monitor with fast refresh rates and support for Gsync and Freesync technologies, while a photographer might be interested in aspects such as color reproduction and Delta-e scores. Therefore, in this paper, we propose a generative aspect summarization model called APSUM that is capable of providing fine-grained summaries of online reviews. To overcome the inherent problem of aspect sparsity, we impose a constraint on the documenttopic distribution by introducing a semi-supervised variation of the spike-and-slab prior, and on the word-topic distribution by incorporating semantic graph. Using a rigorous set of experiments, we show that the proposed model is capable of outperforming the state-of-the-art aspect summarization model over a variety of datasets and deliver intuitive fine-grained summaries that could simplify the purchase decisions of consumers.
The pervasive growth of location-based services such as Foursquare and Yelp has enabled researchers to incorpo- rate better personalization into recommendation models by leveraging the geo-temporal breadcrumbs left by a plethora of travelers. In this paper, we explore Travel path recommendation, which is one of the applications of intelligent urban navigation that aims in recommending sequence of point of interest (POIs) to tourists. Currently, travelers rely on a tedious and time-consuming process of searching the web, browsing through websites such as Trip Advisor, and reading travel blogs to compile an itinerary. On the other hand, people who do not plan ahead of their trip find it extremely difficult to do this in real-time since there are no automated systems that can provide personalized itinerary for travelers. To tackle this problem, we propose a tour recommendation model that uses a probabilistic generative framework to incorporate user's categorical preference, influence from their social circle, the dynamic travel transitions (or patterns) and the popularity of venues to recommend sequence of POIs for tourists. Through comprehensive experiments over a rich dataset of travel patterns from Foursquare, we show that our model is capable of outperforming the state-of-the-art probabilistic tour recommendation model by providing contextual and meaningful recommendation for travelers.
Crowdfunding has gained a widespread popularity by fueling the creative minds of entrepreneurs. Not only has it democratized the funding of startups, it has also bridged the gap between the venture capitalists and the entrepreneurs by providing a plethora of opportunities for people seeking to invest in new business ventures. Nonetheless, despite the huge success of the crowdfunding platforms, not every project reaches its funding goal. One of the main reasons for a project's failure is the difficulty in establishing a linkage between it's founders and those investors who are interested in funding such projects. A potential solution to this problem is to develop recommendation systems that suggest suitable projects to crowdfunding investors by capturing their interests. In this paper, we explore Kickstarter, a popular reward-based crowdfunding platform. Being a highly heterogeneous platform, Kickstarter is fuelled by a dynamic community of people who constantly interact with each other before investing in projects. Therefore, the decision to invest in a project depends not only on the preference of individuals, but also on the influence of groups that a person belongs and the on-going status of the projects. In this paper, we propose a probabilistic recommendation model, called CrowdRec, that recommends Kickstarter projects to a group of investors by incorporating the on-going status of projects, the personal preference of individual members, and the collective preference of the group . Using a comprehensive dataset of over 40K crowdfunding groups and 5K projects, we show that our model is effective in recommending projects to groups of Kickstarter users.
Crowdfunding has gained widespread attention in recent years. Despite the huge success of crowdfunding platforms, the percentage of projects that succeed in achieving their desired goal amount is only around 40%. Moreover, many of these crowdfunding platforms follow "all-or-nothing" policy which means the pledged amount is collected only if the goal is reached within a certain predefined time duration. Hence, estimating the probability of success for a project is one of the most important research challenges in the crowdfunding domain. To predict the project success, there is a need for new prediction models that can potentially combine the power of both classification (which incorporate both successful and failed projects) and regression (for estimating the time for success). In this paper, we formulate the project success prediction as a survival analysis problem and apply the censored regression approach where one can perform regression in the presence of partial information. We rigorously study the project success time distribution of crowdfunding data and show that the logistic and log-logistic distributions are a natural choice for learning from such data. We investigate various censored regression models using comprehensive data of 18K Kickstarter (a popular crowdfunding platform) projects and 116K corresponding tweets collected from Twitter. We show that the models that take complete advantage of both the successful and failed projects during the training phase will perform significantly better at predicting the success of future projects compared to the ones that only use the successful projects. We provide a rigorous evaluation on many sets of relevant features and show that adding few temporal features that are obtained at the project's early stages can dramatically improve the performance.
Crowdfunding has gained widespread popularity in recent years. By funding entrepreneurs with creative minds, it is gradually taking over the role of venture capitalists who provide the much needed seed capital to jump start business ventures. Despite the huge success of the crowdfunding platforms, not every project is successful in reaching its funding goal. Therefore, in this paper, we intend to answer the following question "what set of features determine a project's success?". We begin by studying the dynamics of Kickstarter, a popular reward-based crowdfunding platform, and the impact of social networks on this platform. Contrary to previous studies, our analysis is not restricted to project-based features alone; instead, we expand the features into four different categories: temporal traits, personal traits, geo-location traits, and network traits. Using a comprehensive dataset of 18K projects and 116K tweets, we provide several unique insights about these features and their effects on the success of Kickstarter projects. Based on these insights, we build a supervised learning framework to learn a model that can recommend a set of investors to Kickstarter projects. By utilizing features from the first three days of project duration alone, we show that our results are significantly better than the previous studies.
Although topic models designed for textual collections annotated with geographical meta-data have been previously shown to be effective at capturing vocabulary preferences of people living in different geographical regions, little is known about their utility for information retrieval in general or microblog retrieval in particular. In this work, we propose simple and scalable geographical latent variable generative models and a method to improve the accuracy of retrieval from collections of geo-tagged documents through document expansion that is based on the topics identified by the proposed models. In particular, we experimentally compare the retrieval effectiveness of four geographical latent variable models: two geographical variants of post-hoc LDA, latent variable model without hidden topics and a topic model that can separate background from geographically-specific topics. The experiments conducted on TREC microblog datasets demonstrate significant improvement in search accuracy of the proposed method over both the traditional probabilistic retrieval model and retrieval models utilizing geographical post-hoc variants of LDA.
Lists in social networks have become popular tools to orga-nize content. This paper proposes a novel framework for rec-ommending lists to users by combining several features thatjointly capture their personal interests. Our contribution is oftwo-fold. First, we develop a ListRec model that leveragesthe dynamically varying tweet content, the network of twitterers and the popularity of lists to collectively model the users’preference towards social lists. Second, we use the topicalinterests of users, and the list network structure to developa novel network-based model called the LIST-PAGERANK.We use this model to recommend auxiliary lists that are morepopular than the lists that are currently subscribed by theusers. We evaluate our ListRec model using the Twitterdataset consisting of 2988 direct list subscriptions. Using au-tomatic evaluation technique, we compare the performanceof the ListRec model with different baseline methods andother competing approaches and show that our model deliversbetter precision in terms of the prediction of the subscribedlists of the twitterers. Furthermore, we also demonstrate the importance of combining different weighting schemes andtheir effect on capturing users’ interest towards Twitter lists.To evaluate the LIST-PAGERANK model, we employ a user-study based evaluation to show that the model is effective inrecommending auxiliary lists that are more authoritative thanthe lists subscribed by the users.
Automatic detection of tweets that provide Location-specific information will be extremely useful in conveying geo-location based knowledge to the users. However, there is a significant challenge in retrieving such tweets due to the sparsity of geo-tag information, the short textual nature of tweets, and the lack of pre-defined set of topics. In this paper, we develop a novel framework to identify and summarize tweets that are specific to a location. First, we propose a weighting scheme called Location Centric Word Co-occurrence (LCWC) that uses the content of the tweets and the network information of the twitterers to identify tweets that are location-specific. We evaluate the proposed model using a set of annotated tweets and compare the performance with other weighting schemes studied in the literature. This paper reports three key findings: (a) top trending tweets from a location are poor descriptors of location-specific tweets, (b) ranking tweets purely based on users' geo-location cannot ascertain the location specificity of tweets, and (c) users' network information plays an important role in determining the location-specific characteristics of the tweets. Finally, we train a topic model based on Latent Dirichlet Allocation (LDA) using a large collection of local news database and tweet-based Urls to predict the topics from the location-specific tweets and present them using an interactive web-based interface.