Skip to main content
  • add
  • Professor Feldman is a world-leading expert in the field of text-mining: in fact, he coined the term “text mining” ov... more edit
... Different types of style · Scientific papers · Newspapers · memos · Emails · Speech transcripts Type of Document · Tables · Graphics · Small messages vs. Books Page 21. 21 ... Examples: Said as a person name (male) Alberta as a name... more
... Different types of style · Scientific papers · Newspapers · memos · Emails · Speech transcripts Type of Document · Tables · Graphics · Small messages vs. Books Page 21. 21 ... Examples: Said as a person name (male) Alberta as a name of a person (female) ...
Product discussion boards are a rich source of information about consumer sentiment about products, which is being increasingly exploited. Most sentiment analysis has looked at single products in isolation, but users often compare... more
Product discussion boards are a rich source of information about consumer sentiment about products, which is being increasingly exploited. Most sentiment analysis has looked at single products in isolation, but users often compare different products, stating which they like better and why. We present a set of techniques for analyzing how consumers view product markets. Specifically, we extracted relative sentiment analysis and comparisons between products, to understand what attributes users compare products on, and which products they prefer on each dimension. We illustrate these methods in an extended case study analyzing the sedan car markets.
Abstract Product discussion boards are a rich source of information about consumer sentiment about products, which is being increasingly exploited. Most sentiment analysis has looked at single products in isolation, but users often... more
Abstract Product discussion boards are a rich source of information about consumer sentiment about products, which is being increasingly exploited. Most sentiment analysis has looked at single products in isolation, but users often compare different products, stating which they like better and why. We present a set of techniques for analyzing how consumers view product markets. Specifically, we extracted relative sentiment analysis and comparisons between products, to understand what attributes users compare products on, ...
One of the main problems in building rule based systems is that the knowledge elicited from experts is not always correct. Therefore there is a need for means of revising the rule base whenever an inaccuracy is discovered. The rule base... more
One of the main problems in building rule based systems is that the knowledge elicited from experts is not always correct. Therefore there is a need for means of revising the rule base whenever an inaccuracy is discovered. The rule base revision is the problem of how best to go about revising a deficient rule base using information contained in cases that expose inaccuracies. The revision process is very sensitive to implicit and explicit biases that are encoded in the specific revision algorithm employed. In a sense, each revision algorithm must provide two forms of biases. The first bias governs the preferred location in the rule base for the correction, while the second bias governs the type of correction performed. In this paper we present a system for incremental revision of rule bases called FRST (Forward chaining Revision SysTem). This system enables the user to analyze the impact of different revisions and to select the most appropriate revision operator. The user provides t...
Information published in online stock investment message boards, and more recently in stock microblogs, is considered highly valuable by many investors. Previous work focused on aggregation of sentiment from all users. However, in this... more
Information published in online stock investment message boards, and more recently in stock microblogs, is considered highly valuable by many investors. Previous work focused on aggregation of sentiment from all users. However, in this work we show that it is beneficial to distinguish expert users from non-experts. We propose a general framework for identifying expert investors, and use it as a basis for several models that predict stock rise from stock microblogging messages (stock tweets). In particular, we present two methods that combine expert identification and per-user unsupervised learning. These methods were shown to achieve relatively high precision in predicting stock rise, and significantly outperform our baseline. In addition, our work provides an in-depth analysis of the content and potential usefulness of stock tweets.
This study explores whether the Management Discussion and Analysis (MD&A) section of Form 10-Q and 10-K has incremental information content beyond financial measures such as earnings surprises, accruals and operating cash flows (OCF). It... more
This study explores whether the Management Discussion and Analysis (MD&A) section of Form 10-Q and 10-K has incremental information content beyond financial measures such as earnings surprises, accruals and operating cash flows (OCF). It uses a well-established classification scheme of words into positive and negative categories to measure the tone and sentiment in a specific MD&A section as compared to those of the prior four filings. Our results indicate that short window market reactions around the SEC filing are significantly associated with the tone of the MD&A section, even after controlling for accruals, OCF and earnings surprises. We also show that the tone of the MD&A section adds significantly to portfolio drift returns in the window of two days after the SEC filing date through one day after the subsequent quarter’s preliminary earnings announcement, beyond financial information conveyed by accruals, OCF and earnings surprises. The Incremental Information Content of Tone ...
Many errors produced by unsupervised and semi-supervised relation extraction (RE) systems occur because of wrong recognition of entities that participate in the relations. This is especially true for systems that do not use separate... more
Many errors produced by unsupervised and semi-supervised relation extraction (RE) systems occur because of wrong recognition of entities that participate in the relations. This is especially true for systems that do not use separate named-entity recognition components, instead relying on general-purpose shallow parsing. Such systems have greater applicability, because they are able to extract relations that contain attributes of unknown types. However, this generality comes with the cost in accuracy. In this paper we show how to use corpus statistics to validate and correct the arguments of extracted relation instances, improving the overall RE performance. We test the methods on SRES – a self-supervised Web relation extraction system. We also compare the performance of corpus-based methods to the performance of validation and correction methods based on supervised NER components.
This paper describes a framework for defining domain specific Feature Functions in a user friendly form to be used in a Maximum Entropy Markov Model (MEMM) for the Named Entity Recognition (NER) task. Our system called MERGE allows... more
This paper describes a framework for defining domain specific Feature Functions in a user friendly form to be used in a Maximum Entropy Markov Model (MEMM) for the Named Entity Recognition (NER) task. Our system called MERGE allows defining general Feature Function Templates, as well as Linguistic Rules incorporated into the classifier. The simple way of translating these rules into specific feature functions are shown. We show that MERGE can perform better from both purely machine learning based systems and purely-knowledge based approaches by some small expert interaction of rule-tuning.
One of the main problems in building expert systems is that the knowledge elicited from experts tends to be only approximately correct. We therefore need a means of revising our knowledge base whenever we discover such an inaccuracy. The... more
One of the main problems in building expert systems is that the knowledge elicited from experts tends to be only approximately correct. We therefore need a means of revising our knowledge base whenever we discover such an inaccuracy. The theory revision problem is the problem of how best to go about revising a deficient knowledge base using information contained in examples that expose inaccuracies. In this thesis we present our approach to the theory revision problem for propositional domain theories. Our approach, which we call PTR (Probabilistic Theory Revision) is based on an underlying mathematical model and therefore exhibits several useful properties which other theory revision algorithm do not have.
Research Interests:
The information age is characterized by a rapid growth in the amount of information available in electronic media. Traditional data handling methods are not adequate to cope with this flood of information. Knowledge discovery in databases... more
The information age is characterized by a rapid growth in the amount of information available in electronic media. Traditional data handling methods are not adequate to cope with this flood of information. Knowledge discovery in databases (KDD) is a new paradigm that focuses on automatic or semiautomatic exploration of large amounts of data and on discovery of relevant and interesting patterns within them. While most work on KDD is concerned with structured databases, it is clear that this paradigm is required for handling the huge amount of information that is available only in unstructured textual form. To apply KDD on texts, it is necessary to impose some structure on the data that would be rich enough to allow for interesting KDD operations. On the other hand, we must consider the severe limitations of current text processing technology and define rather simple structures that can be extracted from texts fairly automatically and at a reasonable cost. One of the options is to use...
ABSTRACT A basic tenet of financial economics is that asset prices change in response to unexpected fundamental information. Since Roll’s (1988) provocative presidential address that showed little relation between stock prices and news,... more
ABSTRACT A basic tenet of financial economics is that asset prices change in response to unexpected fundamental information. Since Roll’s (1988) provocative presidential address that showed little relation between stock prices and news, however, the finance literature has had limited success reversing this finding. This paper revisits this topic in a novel way. Using advancements in the area of textual analysis, we are better able to identify relevant news, both by type and by tone. Once news is correctly identified in this manner, there is considerably more evidence of a strong relationship between stock price changes and information. For example, market model R-squareds are no longer the same on news versus no news days (i.e., Roll’s (1988) infamous result), but now are 16% versus 33%; variance ratios of returns on identified news versus no news days are 120% higher versus only 20% for unidentified news versus no news; and, conditional on extreme moves, stock price reversals occur on no news days, while identified news days show an opposite effect, namely a strong degree of continuation. A number of these results are strengthened further when the tone of the news is taken into account by measuring the positive/negative sentiment of the news story.Institutional subscribers to the NBER working paper series, and residents of developing countries may download this paper without additional charge at www.nber.org.
11.. Abstract The availability of online text documents exposes the readers to a vast amount of potentially valuable information buried in those texts. The huge number of documents created the pressing need for automated methods of... more
11.. Abstract The availability of online text documents exposes the readers to a vast amount of potentially valuable information buried in those texts. The huge number of documents created the pressing need for automated methods of discovering relevant information without having the need to read it all. Information Extraction (IE) from documents is one of approaches in text mining which extracts the features (entities) from documents.
Page 1. KDD-98 Organization General Conference Chair Gregory Piatetsky-Shapiro, Knowledge Stream Partners Program Cochairs Rakesh Agrawal, IBM Almaden Research Center Paul Stolorz, Jet Propulsion Laboratory Publicity Chair Foster Provost,... more
Page 1. KDD-98 Organization General Conference Chair Gregory Piatetsky-Shapiro, Knowledge Stream Partners Program Cochairs Rakesh Agrawal, IBM Almaden Research Center Paul Stolorz, Jet Propulsion Laboratory Publicity Chair Foster Provost, Bell Atlantic Science and Technology Tutorial Chair Padhraic Smyth, University of California, Irvine Panel Chair Willi Kloesgen, GMD, Germany Workshops Chair Ronny Kohavi, Silicon Graphics Exhibits Chair Ismail Parsa, Epsilon ...
Text Mining is the automatic discovery of new, previously unknown information, by automatic analysis of various textual resources. Text mining starts by extracting facts and events from textual sources and then enables forming new... more
Text Mining is the automatic discovery of new, previously unknown information, by automatic analysis of various textual resources. Text mining starts by extracting facts and events from textual sources and then enables forming new hypotheses that are further explored by traditional Data Mining and data analysis methods. In this chapter we will define text mining and describe the three main approaches for performing information extraction. In addition, we will describe how we can visually display and analyze the outcome of the information extraction process.
... edu Joshua Livnat Professor of Accounting Stern School of Business Administration New York University 10-76 K-MEC Hall 44 W. 4th St. ... WSJ and the Dow Jones News Service (DJNS) columns about S&P 500 firms to predict future ...
ABSTRACT Document Explorer is a data mining system for document collections. Such a collection represents an application domain, and the primary goal of the system is to derive patterns that provide knowledge about this domain.... more
ABSTRACT Document Explorer is a data mining system for document collections. Such a collection represents an application domain, and the primary goal of the system is to derive patterns that provide knowledge about this domain. Additionally, the derived patterns can be used to browse the collection. Document Explorer searches for patterns that capture relations between concepts of the domain. The patterns that have been verified as interesting are structured and presented in a visual user interface allowing the user to operate on the results to refine and redirect mining queries or to access the associated documents. The system offers preprocessing tools to construct or refine a knowledge base of domain concepts and to create an intermediate representation of the document collection that will be used by all subsequent data mining operations. The main pattern types the system can search for are frequent sets, associations (see Chapter 16.2.3 of this handbook), concept distributions, and keyword graphs.

And 112 more