An empirical evaluation of social influence metrics

154790-Thumbnail Image.png
Description
Predicting when an individual will adopt a new behavior is an important problem in application domains such as marketing and public health. This thesis examines the performance of a wide variety of social network based measurements proposed in the

Predicting when an individual will adopt a new behavior is an important problem in application domains such as marketing and public health. This thesis examines the performance of a wide variety of social network based measurements proposed in the literature - which have not been previously compared directly. This research studies the probability of an individual becoming influenced based on measurements derived from neighborhood (i.e. number of influencers, personal network exposure), structural diversity, locality, temporal measures, cascade measures, and metadata. It also examines the ability to predict influence based on choice of the classifier and how the ratio of positive to negative samples in both training and testing affect prediction results - further enabling practical use of these concepts for social influence applications.
Date Created
2016
Agent

Web Intelligence for Scaling Discourse of Organizations

154774-Thumbnail Image.png
Description
Internet and social media devices created a new public space for debate on political

and social topics (Papacharissi 2002; Himelboim 2010). Hotly debated issues

span all spheres of human activity; from liberal vs. conservative politics, to radical

vs. counter-radical religious debate, to climate

Internet and social media devices created a new public space for debate on political

and social topics (Papacharissi 2002; Himelboim 2010). Hotly debated issues

span all spheres of human activity; from liberal vs. conservative politics, to radical

vs. counter-radical religious debate, to climate change debate in scientific community,

to globalization debate in economics, and to nuclear disarmament debate in

security. Many prominent ’camps’ have emerged within Internet debate rhetoric and

practice (Dahlberg, n.d.).

In this research I utilized feature extraction and model fitting techniques to process

the rhetoric found in the web sites of 23 Indonesian Islamic religious organizations,

later with 26 similar organizations from the United Kingdom to profile their

ideology and activity patterns along a hypothesized radical/counter-radical scale, and

presented an end-to-end system that is able to help researchers to visualize the data

in an interactive fashion on a time line. The subject data of this study is the articles

downloaded from the web sites of these organizations dating from 2001 to 2011,

and in 2013. I developed algorithms to rank these organizations by assigning them

to probable positions on the scale. I showed that the developed Rasch model fits

the data using Andersen’s LR-test (likelihood ratio). I created a gold standard of

the ranking of these organizations through an expertise elicitation tool. Then using

my system I computed expert-to-expert agreements, and then presented experimental

results comparing the performance of three baseline methods to show that the

Rasch model not only outperforms the baseline methods, but it was also the only

system that performs at expert-level accuracy.

I developed an end-to-end system that receives list of organizations from experts,

mines their web corpus, prepare discourse topic lists with expert support, and then

ranks them on scales with partial expert interaction, and finally presents them on an

easy to use web based analytic system.
Date Created
2016
Agent

Directional prediction of stock prices using breaking news on Twitter

154769-Thumbnail Image.png
Description
Stock market news and investing tips are popular topics in Twitter. In this dissertation, first I utilize a 5-year financial news corpus comprising over 50,000 articles collected from the NASDAQ website matching the 30 stock symbols in Dow Jones Index

Stock market news and investing tips are popular topics in Twitter. In this dissertation, first I utilize a 5-year financial news corpus comprising over 50,000 articles collected from the NASDAQ website matching the 30 stock symbols in Dow Jones Index (DJI) to train a directional stock price prediction system based on news content. Next, I proceed to show that information in articles indicated by breaking Tweet volumes leads to a statistically significant boost in the hourly directional prediction accuracies for the DJI stock prices mentioned in these articles. Secondly, I show that using document-level sentiment extraction does not yield a statistically significant boost in the directional predictive accuracies in the presence of other 1-gram keyword features. Thirdly I test the performance of the system on several time-frames and identify the 4 hour time-frame for both the price charts and for Tweet breakout detection as the best time-frame combination. Finally, I develop a set of price momentum based trade exit rules to cut losing trades early and to allow the winning trades run longer. I show that the Tweet volume breakout based trading system with the price momentum based exit rules not only improves the winning accuracy and the return on investment, but it also lowers the maximum drawdown and achieves the highest overall return over maximum drawdown.
Date Created
2016
Agent

Sentiment analysis for long-term stock prediction

154756-Thumbnail Image.png
Description
There have been extensive research in how news and twitter feeds can affect the outcome of a given stock. However, a majority of this research has studied the short term effects of sentiment with a given stock price.

There have been extensive research in how news and twitter feeds can affect the outcome of a given stock. However, a majority of this research has studied the short term effects of sentiment with a given stock price. Within this research, I studied the long-term effects of a given stock price using fundamental analysis techniques. Within this research, I collected both sentiment data and fundamental data for Apple Inc., Microsoft Corp., and Peabody Energy Corp. Using a neural network algorithm, I found that sentiment does have an effect on the annual growth of these companies but the fundamentals are more relevant when determining overall growth. The stocks which show more consistent growth hold more importance on the previous year’s stock price but companies which have less consistency in their growth showed more reliance on the revenue growth and sentiment on the overall company and CEO. I discuss how I collected my research data and used a multi-layered perceptron to predict a threshold growth of a given stock. The threshold used for this particular research was 10%. I then showed the prediction of this threshold using my perceptron and afterwards, perform an f anova test on my choice of features. The results showed the fundamentals being the better predictor of stock information but fundamentals came in a close second in several cases, proving sentiment does hold an effect over long term growth.
Date Created
2016
Agent

A timeline extraction approach to derive drug usage patterns in pregnant women using social media

154641-Thumbnail Image.png
Description
Proliferation of social media websites and discussion forums in the last decade has resulted in social media mining emerging as an effective mechanism to extract consumer patterns. Most research on social media and pharmacovigilance have concentrated on

Adverse Drug Reaction (ADR)

Proliferation of social media websites and discussion forums in the last decade has resulted in social media mining emerging as an effective mechanism to extract consumer patterns. Most research on social media and pharmacovigilance have concentrated on

Adverse Drug Reaction (ADR) identification. Such methods employ a step of drug search followed by classification of the associated text as consisting an ADR or not. Although this method works efficiently for ADR classifications, if ADR evidence is present in users posts over time, drug mentions fail to capture such ADRs. It also fails to record additional user information which may provide an opportunity to perform an in-depth analysis for lifestyle habits and possible reasons for any medical problems.

Pre-market clinical trials for drugs generally do not include pregnant women, and so their effects on pregnancy outcomes are not discovered early. This thesis presents a thorough, alternative strategy for assessing the safety profiles of drugs during pregnancy by utilizing user timelines from social media. I explore the use of a variety of state-of-the-art social media mining techniques, including rule-based and machine learning techniques, to identify pregnant women, monitor their drug usage patterns, categorize their birth outcomes, and attempt to discover associations between drugs and bad birth outcomes.

The technique used models user timelines as longitudinal patient networks, which provide us with a variety of key information about pregnancy, drug usage, and post-

birth reactions. I evaluate the distinct parts of the pipeline separately, validating the usefulness of each step. The approach to use user timelines in this fashion has produced very encouraging results, and can be employed for a range of other important tasks where users/patients are required to be followed over time to derive population-based measures.
Date Created
2016
Agent

Predicting demographic and financial attributes in a bank marketing dataset

154589-Thumbnail Image.png
Description
Bank institutions employ several marketing strategies to maximize new customer acquisition as well as current customer retention. Telemarketing is one such approach taken where individual customers are contacted by bank representatives with offers. These telemarketing strategies can be

Bank institutions employ several marketing strategies to maximize new customer acquisition as well as current customer retention. Telemarketing is one such approach taken where individual customers are contacted by bank representatives with offers. These telemarketing strategies can be improved in combination with data mining techniques that allow predictability of customer information and interests. In this thesis, bank telemarketing data from a Portuguese banking institution were analyzed to determine predictability of several client demographic and financial attributes and find most contributing factors in each. Data were preprocessed to ensure quality, and then data mining models were generated for the attributes with logistic regression, support vector machine (SVM) and random forest using Orange as the data mining tool. Results were analyzed using precision, recall and F1 score.
Date Created
2016
Agent

Visual analytics for spatiotemporal cluster analysis

154403-Thumbnail Image.png
Description
Traditionally, visualization is one of the most important and commonly used methods of generating insight into large scale data. Particularly for spatiotemporal data, the translation of such data into a visual form allows users to quickly see patterns, explore summaries

Traditionally, visualization is one of the most important and commonly used methods of generating insight into large scale data. Particularly for spatiotemporal data, the translation of such data into a visual form allows users to quickly see patterns, explore summaries and relate domain knowledge about underlying geographical phenomena that would not be apparent in tabular form. However, several critical challenges arise when visualizing and exploring these large spatiotemporal datasets. While, the underlying geographical component of the data lends itself well to univariate visualization in the form of traditional cartographic representations (e.g., choropleth, isopleth, dasymetric maps), as the data becomes multivariate, cartographic representations become more complex. To simplify the visual representations, analytical methods such as clustering and feature extraction are often applied as part of the classification phase. The automatic classification can then be rendered onto a map; however, one common issue in data classification is that items near a classification boundary are often mislabeled.

This thesis explores methods to augment the automated spatial classification by utilizing interactive machine learning as part of the cluster creation step. First, this thesis explores the design space for spatiotemporal analysis through the development of a comprehensive data wrangling and exploratory data analysis platform. Second, this system is augmented with a novel method for evaluating the visual impact of edge cases for multivariate geographic projections. Finally, system features and functionality are demonstrated through a series of case studies, with key features including similarity analysis, multivariate clustering, and novel visual support for cluster comparison.
Date Created
2016
Agent

Locality sensitive indexing for efficient high-dimensional query answering in the presence of excluded regions

154272-Thumbnail Image.png
Description
Similarity search in high-dimensional spaces is popular for applications like image

processing, time series, and genome data. In higher dimensions, the phenomenon of

curse of dimensionality kills the effectiveness of most of the index structures, giving

way to approximate methods like Locality Sensitive

Similarity search in high-dimensional spaces is popular for applications like image

processing, time series, and genome data. In higher dimensions, the phenomenon of

curse of dimensionality kills the effectiveness of most of the index structures, giving

way to approximate methods like Locality Sensitive Hashing (LSH), to answer similarity

searches. In addition to range searches and k-nearest neighbor searches, there

is a need to answer negative queries formed by excluded regions, in high-dimensional

data. Though there have been a slew of variants of LSH to improve efficiency, reduce

storage, and provide better accuracies, none of the techniques are capable of

answering queries in the presence of excluded regions.

This thesis provides a novel approach to handle such negative queries. This is

achieved by creating a prefix based hierarchical index structure. First, the higher

dimensional space is projected to a lower dimension space. Then, a one-dimensional

ordering is developed, while retaining the hierarchical traits. The algorithm intelligently

prunes the irrelevant candidates while answering queries in the presence of

excluded regions. While naive LSH would need to filter out the negative query results

from the main results, the new algorithm minimizes the need to fetch the redundant

results in the first place. Experiment results show that this reduces post-processing

cost thereby reducing the query processing time.
Date Created
2016
Agent

Multi-variate time series similarity measures and their robustness against temporal asynchrony

154174-Thumbnail Image.png
Description
The amount of time series data generated is increasing due to the integration of sensor technologies with everyday applications, such as gesture recognition, energy optimization, health care, video surveillance. The use of multiple sensors simultaneously

for capturing different aspects of the

The amount of time series data generated is increasing due to the integration of sensor technologies with everyday applications, such as gesture recognition, energy optimization, health care, video surveillance. The use of multiple sensors simultaneously

for capturing different aspects of the real world attributes has also led to an increase in dimensionality from uni-variate to multi-variate time series. This has facilitated richer data representation but also has necessitated algorithms determining similarity between two multi-variate time series for search and analysis.

Various algorithms have been extended from uni-variate to multi-variate case, such as multi-variate versions of Euclidean distance, edit distance, dynamic time warping. However, it has not been studied how these algorithms account for asynchronous in time series. Human gestures, for example, exhibit asynchrony in their patterns as different subjects perform the same gesture with varying movements in their patterns at different speeds. In this thesis, we propose several algorithms (some of which also leverage metadata describing the relationships among the variates). In particular, we present several techniques that leverage the contextual relationships among the variates when measuring multi-variate time series similarities. Based on the way correlation is leveraged, various weighing mechanisms have been proposed that determine the importance of a dimension for discriminating between the time series as giving the same weight to each dimension can led to misclassification. We next study the robustness of the considered techniques against different temporal asynchronies, including shifts and stretching.

Exhaustive experiments were carried on datasets with multiple types and amounts of temporal asynchronies. It has been observed that accuracy of algorithms that rely on data to discover variate relationships can be low under the presence of temporal asynchrony, whereas in case of algorithms that rely on external metadata, robustness against asynchronous distortions tends to be stronger. Specifically, algorithms using external metadata have better classification accuracy and cluster separation than existing state-of-the-art work, such as EROS, PCA, and naive dynamic time warping.
Date Created
2015
Agent

Self-configuring and self-adaptive environment control systems for buildings

154084-Thumbnail Image.png
Description
Lighting systems and air-conditioning systems are two of the largest energy consuming end-uses in buildings. Lighting control in smart buildings and homes can be automated by having computer controlled lights and window blinds along with illumination sensors that are distributed

Lighting systems and air-conditioning systems are two of the largest energy consuming end-uses in buildings. Lighting control in smart buildings and homes can be automated by having computer controlled lights and window blinds along with illumination sensors that are distributed in the building, while temperature control can be automated by having computer controlled air-conditioning systems. However, programming actuators in a large-scale environment for buildings and homes can be time consuming and expensive. This dissertation presents an approach that algorithmically sets up the control system that can automate any building without requiring custom programming. This is achieved by imbibing the system self calibrating and self learning abilities.

For lighting control, the dissertation describes how the problem is non-deterministic polynomial-time hard(NP-Hard) but can be resolved by heuristics. The resulting system controls blinds to ensure uniform lighting and also adds artificial illumination to ensure light coverage remains adequate at all times of the day, while adjusting for weather and seasons. In the absence of daylight, the system resorts to artificial lighting.

For temperature control, the dissertation describes how the temperature control problem is modeled using convex quadratic programming. The impact of every air conditioner on each sensor at a particular time is learnt using a linear regression model. The resulting system controls air-conditioning equipments to ensure the maintenance of user comfort and low cost of energy consumptions. The system can be deployed in large scale environments. It can accept multiple target setpoints at a time, which improves the flexibility and efficiency of cooling systems requiring temperature control.

The methods proposed work as generic control algorithms and are not preprogrammed for a particular place or building. The feasibility, adaptivity and scalability features of the system have been validated through various actual and simulated experiments.
Date Created
2015
Agent