Each paper will be 6-8 double spaced pages and be based on a journal article or issue of the student’s choice. If you use an online source, check that it provides credible information.

Each paper will be 6-8 double spaced pages and be based on a journal article or issue of the student’s choice. If you use an online source, check that it provides credible information.   The discussion papers should summarize and analyze the main points put forward by the author(s). Do you agree or disagree with the author, and what is the basis for your position? Exceptional work would include additional research and thoughtful synthesis of the authors’ ideas with your ideas. You should choose an article that focuses on financial management and/or accounting content. Leveraging Financial Social Media Data for Corporate Fraud Detection WEI DONG, SHAOYI LIAO, AND ZHONGJU ZHANG WEI DONG (weidong1@mail.ustc.edu.cn ) is a Ph.D. candidate in management science and engineering at the School of Management, University of Science and Technology of China. He is in a joint doctoral program with City University of Hong Kong. His research interests include social media, text mining, and business intelli- gence. He has published in European Journal of Operational Research . SHAOYI LIAO (issliao@cityu.edu.hk ) is a professor in the Department of Information Systems, City University of Hong Kong. He obtained his Ph.D. in information systems from the Aix-Marseille University, France. His research is focused on artificial intelligence, business intelligence, and social media analytics. He has published in MIS Quarterly, INFORMS Journal on Computing, Decision Support Systems , and ACM Transactions on Management Information Systems , among others.

ZHONGJU ZHANG (Zhongju.Zhang@asu.edu ; corresponding author) is codirector of the Actionable Analytics Lab and an associate professor of information systems at the W. P. Carey School of Business, Arizona State University. His research focuses on how information technology and data analytics impact consumer behavior and decision making, create business value, and transform business models. His work has appeared in the leading academic journals including Information Systems Research, Journal of Management Information Systems, MIS Quarterly, Production and Operations Management, INFORMS Journal on Computing , and others. He has won numerous research and teaching awards.

ABSTRACT : Corporate fraud can lead to significant financial losses and cause immea- surable damage to investor confidence and the overall economy. Detection of such frauds is a time-consuming and challenging task. Traditionally, researchers have been relying on financial data and/or textual content from financial statements to detect corporate fraud. Guided by systemic functional linguistics (SFL) theory, we propose an analytic framework that taps into unstructured data from financial social media platforms to assess the risk of corporate fraud. We assemble a unique data set including 64 fraudulent firms and a matched sample of 64 nonfraudulent firms, as well as the social media data prior to the firm ’s alleged fraud violation in Accounting and Auditing Enforcement Releases (AAERs). Our framework automatically extracts signals such as sentiment features, emotion features, topic features, lexical features, and social network features, which are then fed into machine learning classifiers for fraud detection. We evaluate and compare the performance of our algorithm against baseline approaches using only financial ratios and language-based features respectively. We further validate the robustness of our algorithm by detect- ing leaked information and rumors, testing the algorithm on a new data set, and Journal of Management Information Systems / 2018, Vol. 35, No. 2, pp. 461 –487. Copyright © Taylor & Francis Group, LLC ISSN 0742 –1222 (print) / ISSN 1557 –928X (online) DOI: https://doi.org/10.1080/07421222.2018.1451954 conducting an applicability check. Our results demonstrate the value of financial social media data and serve as a proof of concept of using such data to complement traditional fraud detection methods.

KEY WORDS AND PHRASES : corporate fraud, financial social media, fraud detection, social media platform, systemic functional linguistics theory, text analytics.

Financial fraud is a serious commercial problem worldwide and has many different types, including corporate fraud, securities and commodities fraud, health-care fraud, financial institution fraud, mortgage fraud, and others [ 24]. Corporate fraud con- tinues to be one of the FBI ’s highest criminal priorities, and is defined as a “deliberate fraud committed by management that injures investors and creditors through misleading financial statements ”[22, p. 28]. Even though the number of corporate fraud cases is relatively lower than that of other kinds of frauds, the financial losses associated with corporate fraud can be devastating once it happens.

For example, the Enron scandal cost shareholders $74 billion while the WorldCom fraud led to 30,000 lost jobs and $180 billion in losses for investors. In addition to the tremendous financial losses, corporate fraud also has the potential to cause immeasurable damage to the overall economy and investor confidence. Therefore, corporate fraud risk assessment and detection before the U.S. Securities and Exchange Commission (SEC) disclosure have received significant attention from both practitioners and academic research.

Existing analytical procedures for corporate fraud investigation highly rely on auditing accountants and regulators, who analyze complex financial records and documents including financial statements. Financial statements, however, can be out of date upon release. They usually discuss a company ’s past operations and perfor- mances for the previous quarter, if not earlier. While useful to identify fraudulent and nonfraudulent activities, data contained in a financial statement are often not appro- priate to detect corporate fraud in a timely manner. According to Liou [ 40], approaches using financial statements often result in an average time lag of around three years from fraud inception to detection of the fraud. Additionally, financial statements may contain misleading and fictitious information; thus, further in-depth research is needed to assess the validity and risks of material misstatement in these documents [ 34 ]. In recent years, financial social media platforms for investment research have burgeoned. In addition to the fundamental research and broad in-depth coverage of various equities, these platforms allow registered users to participate in discussions, offer insights and alternative perspectives, and point out risks or flaws through an interactive forum/commentary mechanism. The user base of the platform is diverse.

Besides investors, industry experts, and financial analysts, the platform also has a large readership including money managers, business leaders, journalists, bloggers, and the public. The opinions and views expressed in the analysis and discussions of 462 DONG, LIAO, AND ZHANG these platforms have been shown to contain value-relevant information and have been used to predict future stock returns and earnings surprises [ 17]. Dyck et al. [ 21] also recognize the power of nontraditional players (such as employees, media out- lets, and public investors) as whistleblowers about a potential violation of federal laws and regulations that has occurred, is ongoing, or is about to occur.

We believe the user-generated content (UGC) on financial social media platforms can be useful in assessing the potential risk of corporate fraud. Anecdotal evidence seems to support this idea. Take NQ Mobile (a Chinese mobile security company) as an example. Muddy Water Research (a market research and short-selling firm) released a harsh assessment citing “a massive fraud ”of NQ Mobile on October 24, 2013. The news led to a 47 percent drop in the NQ stock price overnight. It is, however, worth noting that a Xueqiu (major Chinese social media platform for financial investors) user named “kankan123 ”had released a series of analysis reports questioning NQ Mobile ’s fraud behavior at the beginning of 2013 (detailed informa- tion can be found at http://xueqiu.com/S/NQ/25820468 ), which is more than six months before the report from Muddy Water Research. An analytic framework that takes advantage of such UGC can thus help audit firms, government regulators, securities agencies, and investors to achieve their strategic goals by providing an early and effective fraud detection algorithm to protect public interests. These stakeholders can leverage the framework to better estimate the fraud risk associated with each target firm so as to make informed decisions, such as minimizing exposure to fraudulent firms and determining how to allocate resources to investigate target firms.

In this study, we seek to examine how to extract useful features from UGC on social media platforms and develop a text analytic framework to automatically detect corporate fraud. Our framework is grounded in systemic functional linguistics (SFL) theory [ 29], which provides foundations for our feature sets such as sentiment features, emotion features, topic features, lexical weights features, and social net- work features. We evaluate the performance of our algorithm using data from two platforms: SeekingAlpha and Yahoo Finance. Our extensive analyses demonstrate the efficacy of our algorithm as well as the leading effects of social media content on early corporate fraud detection. Additionally, we validate the practical contributions and implications of our algorithm by conducting an applicability check with four focus groups (each with three domain knowledge experts). The applicability check shows that major stakeholders in the financial industry feel that our approach is a helpful auxiliary tool, providing further evidence of the value of our framework. To the best of our knowledge, this is one of the first studies to use textual data from social media platforms for corporate fraud detection.

Literature Review Financial fraud and fraud detection has been an important topic in both the account- ing and the finance literature. It is relatively understudied in the information systems SOCIAL MEDIA FOR CORPORATE FRAUD DETECTION 463 literature. Here, we review existing literature about different types of financial fraud and the various methods that have been proposed to detect it. We highlight what might be missing and what we can add to further improve the existing methods.

Financial fraud can happen at both the firm and community levels. Community- level fraud usually involves a focal firm as well as external parties (such as customers and/or clients) related to the firm [ 54]. At the firm level, Ngai et al. [49] provided a detailed literature review on detecting financial fraud via data mining methods. Among different types of financial fraud, corporate fraud consists of activities undertaken by an individual (usually top management) to deliberately mislead investors, creditors, and the public so as to gain an unfair advantage. When facing market-driven pressures due to predicament or asset misappropriation because of personal affairs, firm managers tend to “overstate assets, sales and profit, or understate liabilities, expenses or losses ”[69, p. 5519] and disclose unreal growth opportunities in financial statements. With these conceited misrepresentations, Wall Street analysts and public investors will raise their expectations and earnings projections about this company. Likewise, in order to meet the new expectations and projections of the market, the management will have to make another misleading financial statement for the next quarter or fiscal year. This cycle constitutes the business process map of corporate fraud.

In order to improve financial reporting, the American Institute of Certified Public Accountants has established standard accounting principles. To fight corporate fraud, several auditing guidelines have been issued that auditors need to consider in identifying the risk of material misstatement in financial statements. Typically, financial statements for a business include income statements, balance sheets, state- ment of retained earnings, and cash flows. These contents can be broadly classified into structured (e.g., numeric financial variable and ratios, quantitative descriptions of operating conditions) and unstructured data (e.g., management discussion and analysis). Table 1 presents a sample of representative studies that use structured and unstructured data for corporate fraud detection. These studies are conventional fraud detection methods that tap into only traditional data sources such as financial statements and earnings conference calls.

From a methodology perspective, conventional auditing practices rely primarily on statistical analysis of structured financial data [ 15]. Kaminski et al. [ 34], however, argued that financial ratios provide limited ability to detect fraud because manage- ment can create fictitious numbers. Hence some researchers, for example, Brazel et al. [ 10], examined the effectiveness of nonfinancial variables on the risk of corporate fraud. With the development of natural language processing (NLP) tech- niques, researchers have begun to glean the textual contents and signals from financial statements and examine whether they can provide additional sources of information to predict fraud. Table 2 summarizes a list of studies that seek to detect fraud using text mining techniques (either a rule-based dictionary approach or statistical approach) [ 38]. Most of the studies in Table 2 used the management discussion and analysis (MD&A) section of financial reports, which is usually well-written by a firm ’s management team in formal business language. 464 DONG, LIAO, AND ZHANG Additionally, the MD&A section has a fairly rigid content structure, such as discus- sion of financial conditions, results of operations, and forward-looking statements for the company.

Data sources such as financial statements and earnings conference calls are usually well-planned and prepared in advance. Liou [ 40] found that detection methods using financial statements tend to result in a time lag from fraud inception to detection.

More importantly, financial statements and earnings conference calls do not capture opinions and insights from other stakeholders such as public investors and analysts.

Dyck et al. [ 21] and Cecchini et al. [ 16] argued the potential strategic value of using this new source of information —user-generated content from employees, media outlets, and public investors —to predict corporate fraud. Financial social media Table 1. Representative Studies of Corporate Fraud Detection and Data Sources Data type Indicators Literature Data source Structured data Numerical financial variables Cecchini et al. [ 15] Financial statements Financial ratios Summers and Sweeney [64]; Dechow et al. [19]; Abbasi et al. [ 1] Financial statements Nonfinancial variables Brazel et al. [ 10] Financial statements Unstructured data Features from language-based textual content Larcker and Zakolyukina [36] Earnings conference calls Purda and Skillicorn [ 55] MD&A section from financial statements Features from vocal speech Hobson et al. [ 30] Earnings conference calls Social media features Current study Financial social media platform, for example, SeekingAlpha Table 2. Text-based Methods of Corporate Fraud Detection Technique Literature Source of text Dictionary-based method Purda and Skillicorn [ 55] MD&A section from both annual and quarterly reports Larcker and Zakolyukina [ 36] Earnings conference calls Humpherys et al. [ 31] MD&A section of the 10-K report Statistical method Cecchini et al. [ 16]; Glancy and Yadav [ 26]; Moffitt et al. [ 47] MD&A section of the 10-K report Goel and Gangolly [ 27]; Goel et al. [ 28] The entire text of the 10-K report SOCIAL MEDIA FOR CORPORATE FRAUD DETECTION 465 platforms such as SeekingAlpha are natural venues that aggregate such user-gener- ated content online, and thus merit further study to examine their impacts on corporate fraud detection. It should be noted that the process of analyzing unstruc- tured textual data from financial social media platforms is drastically different from that using only the structured data and/or the MD&A section from financial state- ments. Dictionary-based text analysis methods would also not be appropriate here since the contributors of this social media content are not the management team of a firm. It is therefore difficult to construct a context-aware dictionary.

Theoretical Foundation for Social Media-Based Corporate Fraud Detection To capture the salient features from UGC and to understand how users on social media platforms use language to express their opinions about a company ’s operations and performances, we refer to systemic functional linguistics (SFL) theory [ 29 , p. 15]. SFL argues that language is a system of choices/options that writers use to achieve certain goals. The meaning of a text is dependent on those choices within a language system [ 65 ]. The term “systemic ”views language as “a network of systems or interrelated sets of options for making meaning. ”The term “functional ”indicates that the approach is co ncerned with contextualized and practical uses.

SFL theory includes three interrelated fun ctions: ideational, interpersonal, and textual. Ideational function states th at language is about construing ideas [ 29 ]. Interpersonal function refers to language as a medium for interaction, and it is the means for creating and maintaining our interpersonal relations. These two functions are interlinked via the textual function, which determines how informa- tion is organized and presented to create a coherent flow of discourse. In other words, ideational and textual functions focus on the content of messages, while interpersonal function deals with interaction structures.

Ideational Function The ideational function can be represented by topics, opinions, and emotions [ 2]. Textual documents usually exhibit multiple topics [ 8]. Brown et al. [ 11]foundthat these thematic topics are informative in predicting intentional financial statement mis- reporting. We believe that topics discussed in the social media data of a fraudulent firm can differ from those of a legitimate firm and are thus useful in classifying firms. We adopt latent Dirichlet allocation (LDA) [ 8], a widely used topic model, to extract thematic topics from social media data. Opinions are sentiment polarities (e.g., positive, neutral, and negative) about a particular entity [ 52]. According to Buller and Burgoon [13], deceivers (e.g., fraudulent firms) tend to engage in more strategic activity (infor- mation, behavior and image management) designed to project a positive image. We adopt the sentiment words dictionary in the financial domain created by Loughran and 466 DONG, LIAO, AND ZHANG McDonald [ 41] to measure sentimental opinions expressed by users on financial social media platforms. Emotions consist of various affects such as happiness, sadness, horror, and anger [ 2]. Newman et al. [ 48] found that linguistic styles (such as hate, sadness, anger) can predict deception. In this study, we use the emotional categories defined in the Linguistic Inquiry and Word Count dictionary [ 53]tomeasure “assent, ”“ anxiety, ” “anger, ”“ swear, ”and “sadness ”emotions [ 36]. Additionally, according to cognitive theory, cognitive appraisal is a component of emotion [ 37 ]. Cognitive appraisal in the context of corporate fraud refers to how an individual views a firm ’s operations condition. We measure cognitive appraisal by (1) overall description of the fraudulent situation; (2) detailed analysis of the fraudulent behavior; and (3) legal judgments and sanctions.

Three separate word lists are developed to capture each of the above three components. The synonyms of fraud word list contains 120 words that are widely used in the Accounting and Auditing Enforcement Releases (AAERs), such as “phony, ”“ fake, ”“ sham, ”and “deceptive. ”The fraudulent behavior word list contains 136 words such as “mislead, ”“ conceal, ”“ fabricate, ”and “detect. ”The legal judgment word list contains 32 words such as “jurisdiction, ”“ crime, ” “forfeiture, ” and “sanction. ” Ta b l e 3 summarizes the opinion and emotional features as well as their measurements.

Table 3. Measures of Opinions and Emotion Related Features Type Feature Measurement Opinions Ratio of positive sentiment Total number of positive words divided by total number of words* Ratio of negative sentiment Total number of negative words divided by total number of words Emotions Ratio of assent words Total number of assent words divided by total number of words Ratio of anxiety words Total number of anxiety words divided by total number of words Ratio of anger words Total number of anger words divided by total number of words Ratio of swear words Total number of swear words divided by total number of words Ratio of sadness words Total number of sadness words divided by total number of words Ratio of fraud synonyms words Total number of synonyms of fraud divided by total number of words Ratio of fraud analysis words Total number of fraud analysis words divided by total number of words Ratio of legal judgments words Total number of legal judgments words divided by total number of words *Total number of words is the number of words ignoring stop words. SOCIAL MEDIA FOR CORPORATE FRAUD DETECTION 467 Textual Function The major element of the textual function is thematic structure. It shows the progression of what is going on and carries the writer ’s ideology, which to tells people what the writer is really concerned about in his mind [ 29]. Textual function can be conceptualized into three information types: writing styles, genres, and vernaculars [ 2,5]. Writing styles and vernaculars are not applicable in our context since users on social media platforms do not follow a unified writing style and user vernaculars.

Genres in a document represent how writers typically use language to respond to recurring situations [ 32]. Merkl-Davies and Brennan [ 45] found that corporate narratives can be regarded as an identifiable genre for business communication with distinctive linguistic properties. Genres can be distinguished by genre analysis using word frequencies based on corpus linguistics [ 60 ]. We adopt a modified word frequency, term frequency-inverse document frequency (TF-IDF) [ 58], for genre classification. TF-IDF assumes that only terms with a high term frequency in a given document but a low document frequency in the whole collection of documents are important in classifying documents. TF-IDF scheme represents a collection of documents by document-term vectors, of which each element is the weight of a term in a document.

Interpersonal Function Interpersonal function refers to the fact that “language is a medium of exchange between people ”[2,61, p. 75]. It is generally represented by social interaction/ structure that can be built through the reply-to relationships between messages [ 25]. Abbasi and Chen[ 2] found that employees ’social network structure presented in inner-firm e-mails changed after the Enron fraud. Pak and Zhou [ 51] provided new evidence that deception is a strategic activity where the deceiver juggles between the dual goals of promoting deceptive ideas and avoiding detection. They found that social structural characteristics can be used to delineate deception in computer- mediated communication. Numbers of messages, posts, and/or comments have been used to capture social structure in UGC [ 3,4]. In this study, we track the number of Analysis reports (AR), the number of Breaking news (BN), and the number of StockTalk messages (SM) as well as the number of comments to those contents for each firm. In addition, we consider the number of distinctive authors who post AR and SM and the number of posts per author. Finally we track the number of users who follow the news related to a firm. Table 4 summarizes the measures of the features related to interpersonal function.

Based on the above discussions, in Figure 1 , we propose a text analytic framework to predict corporate fraud using financial social media data. Since the dimensionality of the TF-IDF term weights vector is often huge, we employ a dimension reduction technique (principal component analysis) for the TF-IDF 468 DONG, LIAO, AND ZHANG feature selection. Reduced dimension of TF-IDF features lowers the training time of classifiers and avoids potential ove rfitting problems. We use support vector machine (SVM) for document classification. SVM has been shown to be success- ful in working with large feature space and small sample set [ 16 ], and is capable of handling large sparse data [ 33 ]. For comparison purposes, we also implement logistic regression (LR), neural netwo rks (NN), and decision tree (DT) in this study. We use accuracy, recall, F1 score, and the area under the receiver operating characteristic (ROC) curve (AUC), whic h are standard information retrieval metrics [ 43 ], to mathematically evaluate th e quality of trained classifiers. A tenfold cross-validation technique is employed to assess how the model results generalize to independent test data.

Table 4. Measures of Interpersonal Function-Related Features Type Feature and measurement No. of features Social interaction structure Number of Analysis reports (AR), Breaking news (BN), or StockTalk messages (SM) 3 Number of comments to AR, BN, or SM 3 Numbers of distinctive authors for AR, or SM 2 Numbers of AR or SM per author 2 Number of followers 1 Figure 1. An SFL-Based Framework for Corporate Fraud Detection SOCIAL MEDIA FOR CORPORATE FRAUD DETECTION 469 Data Collection Our data came from a few sources. We selected SeekingAlpha ( http://seekingalpha. com ) as the source to collect financial social media data. SeekingAlpha is a crowd- sourced content service platform for investment research, with broad coverage of stocks, asset classes, exchange traded funds (ETFs), and investment strategy. Since its inception in early 2004, SeekingAlpha has grown to be the top destination for stock market opinion and analysis on the Internet with 4 million registered users.

Additionally, in contrast to other equity research platforms, insights on SeekingAlpha are provided by investors and industry experts from the buy-side rather than the sell-side [ 17]. Off-topic discussions are moderated by the 24-hour in- house moderation team at SeekingAlpha: posts on the bulletin boards are categorized by a ticker symbol, and only topic-related posts can be published due to the site ’s “optional post ”feature. For each firm, there are five types of information content: Analysis reports, Breaking news, Earning call transcripts, StockTalk, and Videos.

Analysis reports are created by analysts and platform contributors; Breaking news are created by SeekingAlpha editors, thus can be considered trustworthy [ 62]; Earning call transcripts come from the firm ’s conference call each quarter; StockTalk is organized as discussion forums; Videos are short video clips discussing the focal firm.

We crawled all the contents (including all comments) under the Analysis reports, Breaking news, and StockTalk sections, together with the social network structures for each firm in our sample. Since a fraudulent firm may be disclosed as committing financial fraud in several announcements of the SEC at different times, we consid- ered only the time of the first announcement and extracted only social media data prior to that point. Note that not all fraudulent firms have data on SeekingAlpha prior to their fraud disclosure time. As shown in Figure 2(a ), SeekingAlpha had not been established at the time when the firm ’s fraudulent behavior was disclosed. Figure 2. Timeline of Fraud Period and Establishment of SeekingAlpha 470 DONG, LIAO, AND ZHANG Hence, we did not include firms whose fraudulent behavior is disclosed before the establishment of SeekingAlpha.

In addition to the social media data, we also collected the financial ratios and the textual contents of the MD&A section from the annual financial statements of the firm. These data are used to compare the performance of our algorithm against baseline methods. Based on the literature, financial ratios in the first year of the fraudulent time period (called the first fraud year) for a fraudulent firm are always selected to represent the firm ’s operation performance [ 7,20 ]. The financial ratios of the first fraud year for a firm are selected from a database of global public companies called Compustat. The financial statements of the first fraud year are obtained from the SEC ’s official company file system —the EDGAR database. Sample Selection We used public traded companies in the U.S. stock market to test the performance of the proposed approach. First, all fraudulent publ ic firms were identified and labeled. We used AAERs to screen companies that are involv ed in financial frauds. The SEC has been issuing AAERs since 1982 to investigate a company or other related parties for alleged accounting misconducts. These releases provide varying degrees of details about the nature of the accounting and/or auditing misconduc t in financial statements. Dechow et al. [ 19] developed a comprehensive database contain ing 936 firms (and the misstatement events that affect at least one of the firms ’quarterly or annual financial statements from May 17, 1982, to October 19, 2013) after a thorough analysis of 3,490 AAERs.

We discarded the firms with only quarterly misstatement events because quarterly statements are unaudited [ 10]. This resulted in 804 companies with annual fraudulent events. Among these, 38 of the firms were accused of wrongdoing that is unrelated to financial misstatements, such as auditor issu es, bribes or disclosure-related issues, and others [ 19]. We removed those 38 firms from our sample. For each of the remaining 766 companies, we tried to find the SIC (Standa rd Industry Classification) code from the EDGAR database and the stock symbol from the Compustat database. The SIC code is used to check whether a company is a financial firm, such as banks, insurance companies, and CPA firms, for which the SIC ranges from 6000 to 6999. The stock symbol is the unique identifier we used to extract social me dia data for the company from SeekingAlpha. Companies that cannot be found in these two databases and SeekingAlpha are dropped. Following previous research, we also excluded financial firms whose SICs start with number 6 [ 6]. On one hand, the SEC ’s industry guidelines require specific disclosures for financial companies such as real estate partnerships, property and casualty insurance, and bank holding companies [ 55]. Additionally, accounting rules, asset valuations, and other characteristics for financial companies are different from those for other types of companies [ 23]. Finally, we discarded 22 companies whose financial data was missing during the fraud period in the Compustat database, and 29 firms whose fraudulent behavior was disclosed before the establishment of SeekingAlpha. We also dropped another 56 fraudulent firms that lacked sufficient social media data (less than 500 words SOCIAL MEDIA FOR CORPORATE FRAUD DETECTION 471 excluding stop words) during the period from the establishment of SeekingAlpha to the first disclosure time. Ta b l e 5 details our sample selection process. We matched each fraudulent firm with a control firm (nonfraudulent) for classifi- cation purposes [ 64]. This is an oversampling strategy, which is appropriate for handling rare events [ 3]. Random sampling in this case would result in an extremely high percentage of nonfraudulent firms in the sample, thus making attempts to investigate significant features for predicting corporate fraud not meaningful.

Nonfraudulent firms are selected based on two criteria. First, we tried to find a direct match by using the Compustat database on the basis of the fraud year, firm size, and industry; see Dong et al. [ 20] for detailed descriptions of the matching process. Additionally, each nonfraudulent firm should have enough textual data (at least 500 words excluding stop words) on SeekingAlpha. If many firms meet the selection criteria, one of them will be randomly chosen. The above sampling strategy leads to a 1:1 ratio for fraudulent and nonfraudulent firms. For a robustness check, we also examined the case when the sample is not balanced —that is, the ratio of fraudulent and nonfraudulent firms is not 1:1. Note that social media data for nonfraudulent firms are collected up to August 31, 2015, when this study started.

In summary, our final data set includes 64 fraudulent firms together with a corre- sponding 64 matched nonfraudulent firms.

Data Preprocessing For textual social media and MD&A content, we used the Stanford CoreNLP toolkit [44] to implement sentence segmentation and word tokenization. Punctuation marks, hyperlinks, numerical digits, and special symbols are removed after tokenization.

Furthermore, we removed the stop words developed for the financial industry by Table 5. Sample Selection Process Distinct companies Number Companies with accounting misconducts in Dechow et al. [ 19] data set 936 Less: companies with only quarterly fraudulent events 132 Subtotal (companies with annual fraudulent events) 804 Less: companies with auditor, bribes, disclosure, no dates, and other issues 38 Subtotal (companies with annual corporate fraud) 766 Less: Companies that cannot be found in SEC EDGAR database 111 Less: Companies that cannot be found in Compustat database 38 Less: Companies that cannot be found in SeekingAlpha 343 Less: Financial companies: Banks & Insurance (SIC 6000-6999) 103 Less: Companies ’financial data in fraud years cannot be found in Compustat database 22 Less: Companies that are disclosed before the establishment of SeekingAlpha 29 Less: Companies that do not have enough social media data 56 Total 64 472 DONG, LIAO, AND ZHANG Loughran and McDonald [ 41]. Words that appear only once in the entire corpus were also dropped in this study [ 66] because rare words are usually noninformative and most likely to be just noise [ 68]. The last important step in data preprocessing is stemming. Stemming refers to reducing a word to its base, which helps to discern the importance of specific words within the text. We adopted Wordnet Stemmer [ 46]in this study. Wordnet Stemmer keeps the morphology of a word form, such as nouns, adjectives, adverbs, and verbs. It also better retains the stem of a word; for example, the word “sharper ”(instead of “sharp ”), which could mean a fraud, will be retained by Wordnet Stemmer.

For financial ratios, we adopted a rich set of 84 yearly financial ratios (see Table 2 in the online Appendix). This feature set includes 12 annual financial ratios, 24 industry-collaboration contextual features, 24 industry-competition contextual fea- tures, and 24 organization contextual features. Among the 12 annual financial ratios (R1 to R12), seven of them, AQI, DSIR, DEPI, GMI, LEV, SG, and SGEE, come from Beneish [ 7] who discussed in detail why these ratios are related to financial fraud. The equation of CFED is derived from Dechow [ 18]. The other financial ratios are obtained from Abbasi et al. [ 1]. The industry and organization contextual features are generated based on year-to-year changes of accounting items retrieved from the Compustat database. If there are missing data or zero denominator during the computation, we used the techniques introduced in Beneish [ 7]. Table 6 describes the summary statistics of our data set; the numbers in parenth- eses are the average value.

Analysis and Evaluation We perform comprehensive analysis to systematically evaluate the efficacy of our proposed algorithm. As discussed earlier, four classifiers (SVM, NN, DT, and LR) are used to predict corporate fraud. We iteratively include each of the three cate- gories of input features (financial ratio, MD&A, social media data) to evaluate their effects on fraud detection. To further evaluate robustness, we test the predictive Table 6. Description of the Data Set Dataset (128 firms) No. of Analysis reports No. of Breaking news No. of StockTalk messages No. of sentences No. of words No. of financial ratios Social media data 3,981 (31.10) 2,251 (17.59) 1,672 (13.06) 184,356 (1,440.28) 2,613,362 (20,416.89) — MD&A data —— — 92,712 (724.31) 902,940 (7,054.22) — Financial ratios —— — — — 84 SOCIAL MEDIA FOR CORPORATE FRAUD DETECTION 473 power of our model on a separate holdout sample as well as using data from another financial social media platform Yahoo Finance.

Fraud Detection Using Only Social Media Data The input variables here are the features (discussed in the Theoretical Foundation Section) extracted from social media data. There are 2 sentiment features, 8 emotion features, and 11 social network features. Principal component analysis yields 127 lexical features. Following the topic model by Blei et al. [ 8], we compute the perplexity scores for different number of topics ranging from 20 to 1,000. In the end, the 100 topic feature model (with a minimum perplexity value of 1,239.77) is selected. Statistical descriptions for sentiment, emotion, and social network features are shown in Table 1 of the online Appendix. Table 7 reports the average classification performance of the four classifiers. It can be seen that the SVM model achieves the best testing performance among all classifiers. By contrast, the LR model obtains the worst performance. Using prior- disclosure information from SeekingAlpha, we can predict fraud with 75.50 percent accuracy in the test data set. Since all the social media data are prior to the first fraud disclosure time point, we can develop a fraud detection warning system by closely monitoring and extracting useful features from the social media platform, thus reducing the time lag between financial misstatements and fraud disclosure.

Using the SVM model with the best performance, we investigate each set of social media features independently. Average performances of the model with only social network features, topic features, sentiment and emotion features, and lexical features are shown in Figure 3 . We find that topic features are most predicative of fraud. Comparing social network and lexical features, we note that lexical features are helpful in the training data while social network features do better in the testing data.

Classification abilities of sentiment and emotion features are not as good as the other features in both the training and testing data.

The results of our model using the SVM classifier on imbalanced data with different ratios are presented in Figure 4 . We note that as the ratio of fraudulent to Table 7. Performance of Classification Using Only Social Media Features Average accuracy Average recall Average F1 score Average AUC SVM Training 99.66 99.50 99.66 99.94 Testing 75.50 81.56 76.50 86.32 NN Training 100.00 100.00 100.00 98.26 Testing 63.17 68.05 62.18 53.71 DT Training 98.52 98.30 98.52 96.44 Testing 63.10 66.54 64.93 43.34 LR Training 50.27 87.04 59.96 46.42 Testing 54.50 87.75 60.98 43.70 474 DONG, LIAO, AND ZHANG nonfraudulent firms decreases, the recalls and AUCs decrease slightly while the accuracy of the model increases. F1 scores, however, drop significantly as the data set becomes more imbalanced. The reason is that more nonfraudulent firms in the sample will train the classifier to classify firms as nonfraudulent as much as possible Accuracy Recall F1 AUC 60 65 70 75 80 85 90 95 100 Performance Training Accuracy Recall F1 AUC 50 55 60 65 70 75 Testing Social network Topic Sentiment and emotion Lexical Figure 3.Performance of Classification Using Each Set of Social Media Features Figure 4.Performance of SVM on Imbalanced Data SOCIAL MEDIA FOR CORPORATE FRAUD DETECTION 475 to increase classification accuracy. In other words, the ability for the classifier to detect fraud becomes low, which results in low values of recall and precision. Hence, the F1 score, which is the combination of recall and precision, drops quickly. The overall performances of the model on the imbalanced data set still look good.

Comparison with Baseline Methods In this section, we compare the performance of our model against methods using only structured financial ratios and language-based features from textual MD&A contents.

Baseline Method Using Only Financial Ratios Table 8 documents the average performances of the four classifiers using only financial ratio data. Again, the SVM model achieves the best performance. The average testing accuracy, however, is only 56.17 percent, much lower than the model performance using the social media data. Note that the baseline model performance is lower than that of Abbasi et al. [ 1] using the same financial ratios. This may have to do with treatment of missing values in the data sample.

The performance of the baseline method using financial ratios supports the find- ings by Kaminski et al. [ 34] and Dechow et al. [ 19 ] that a firm ’s financial numbers do not change dramatically between the fraudulent periods and the surrounding truthful years. Thus financial ratios of first fraud year would not be good indicators for discerning fraudulent or nonfraudulent cases. On the contrary, Purda and Skillicorn [ 55] found that word choice may raise red flags from truthful years to fraudulent periods. This explains why classification performances using features from social media data in the previous section are much better than the results using only financial ratios.

Table 8. Performance of Baseline Method Using Only Financial Ratios Average accuracy Average recall Average F1 score Average AUC SVM Training 99.39 98.77 99.37 99.96 Testing 56.17 77.74 63.37 49.29 NN Training 76.58 69.60 74.72 73.09 Testing 48.83 42.39 43.71 41.75 DT Training 97.41 96.75 97.37 95.34 Testing 41.17 42.08 40.09 36.02 LR Training 54.69 60.31 56.88 51.94 Testing 54.67 60.00 54.54 43.58 476 DONG, LIAO, AND ZHANG Baseline Method Using Only Language-Based Features in MD&A We replicate the procedure by Purda and Skillicorn [ 55] to analyze the MD&A section of firms ’financial statements. The most common 1,100 words are used. The fraction of in-bag observations is set to 75 percent. A random forest of 3,000 trees is created based on these in-bag documents and tested on another 25 percent out-bag documents. The top 200 words most predictive of fraud are found and used as language-based input features for classification. Performances of the four classifiers are presented in Table 9 . Here, the LR model achieves the best performance in terms of average accuracy (70.33 percent), recall (71.90 percent), and F1 score (70.10 percent) while SVM has the highest average AUC (69.82 percent). Comparing the numbers in Table 7 , Table 8 , and Table 9 , we demonstrate that our proposed algorithm using social media features outperforms the baseline methods using financial ratios and lan- guage-based features. We also note that the performance of the model using lan- guage-based features from the MD&A sections is better than that of the method using financial ratios.

Incremental Effect of Combined Feature Sets To test the incremental effect of each category of data, we gradually add language- based features and then social media features into financial ratios. The classification performance using three types of feature sets — (1) only financial ratios, (2) a combination of financial ratios and language-based features, and (3) a full combina- tion of financial ratios, language-based features, and social features —are investi- gated. The performances of the four classifiers using these three types of feature sets are recorded in Tables 10 –13. Considering the performances of the SVM classifier in Ta b l e 1 0 , it is clear that the performance of the combined financial ratios and language-based features is better than that using only financial ratios. Moreover, the performance of the fully combined feature set is better than that using the combination of financial ratios Table 9. Performance of Baseline Method Using Only Language-based Features from MD&A Contents Average accuracy Average recall Average F1 score Average AUC SVM Training 97.29 97.01 97.28 99.50 Testing 66.67 66.02 64.25 69.82 NN Training 100.00 100.00 100.00 98.26 Testing 66.33 62.19 64.69 52.09 DT Training 99.14 98.81 99.10 97.57 Testing 52.78 50.83 54.98 54.14 LR Training 100.00 100.00 100.00 98.26 Testing 70.33 71.90 70.10 58.96 SOCIAL MEDIA FOR CORPORATE FRAUD DETECTION 477 and language-based features. Performance is improved when more features are added. The same can be said of the NN model. This result shows that there is indeed incremental value of these three sources of information for fraud detection.

In the DT classifier, the performance combining financial ratios and language- based features is better than that using only financial ratios. But the performance using all the features is slightly worse than that using combined financial ratios and language-based features. In the LR classifier, when adding more features, the final performance decreases. This demonstrates that more features are not necessarily better when classifying fraud. Including all features could lead to overfitted models Table 10. Performance of SVM Classifier Using Combined Features SVM Average accuracy Average recall Average F1 score Average AUC Financial ratios Training 99.39 98.77 99.37 99.96 Testing 56.17 77.74 63.37 49.29 Financial ratios and language-based features Training 98.71 98.60 98.72

"Get 15% discount on your first 3 orders with us"
Use the following coupon

Order Now