WSJ Article Subject Tags

1. Motivation

This post investigates the distribution of subject tags for Wall Street Journal articles that mention S&P 500 companies. e.g., a December 2009 article entitled, When Even Your Phone Tells You You’re Drunk, It’s Time to Call a Taxi, about a new iPhone app that alerted you when you were too drunk to drive had the meta data to the right. The subject tags are essentially article keywords. I collect every article that references an S&P 500 company over the period from 01/01/2008 to 12/31/2012. It is an appendix to my paper, Feature Selection Risk.

I find that there is substantial heterogeneity in how many different topics people write about when discussing a company even after controlling for the number of total articles. e.g., there were $87$ articles in the WSJ referencing Garmin (GRMN) and $81$ articles referencing Sprint (S); however, while there were only $87$ different subject tags used in the articles about Garmin, there were $716$ different subject tags used in the articles about Sprint! This finding is consistent with the idea that some firms face a much wider array of shocks than others. i.e., the width of the market matters.

2. Data Collection

The data are hand-collected from the ProQuest newspaper archive by an RA. Data collection process for an example company, Agilent Technologies (A), is summarized in the $3$ figures below. First, we searched for each company included in the S&P 500 from 01/01/2008 to 12/31/2012 [list]. Then, after each query, we restricted the results to articles found in the WSJ. Finally, we downloaded the articles and meta data in HTML format.

Initial Search Specification

Restriction to WSJ Articles

Downloaded HTML

After the RA collected all of the data, I used a Python script to parse the resulting HTML files into a form I could manage in R. Roughly $4000$ of the downloaded articles were duplicates resulting from the WSJ publishing the same article in different editions. I identify these observations by checking for articles published on the same day with identical word counts about the same companies. I tried using Selenium to automate the data collection process, but the ProQuest web interface proved too finicky.

3. Summary Statistics

My data set contains $106{\scriptstyle \mathrm{k}}$ articles over $5$ years about $542$ companies. Many articles reference multiple S&P 500 companies. The figure below plots the total number of articles in the database per month. There is a steady downward trend. The first part of the sample was the height of the financial crisis, so this makes sense. As markets have calmed down, journalists have devoted fewer articles to corporate news relative to other things such as politics and sports.

Articles are not evenly distributed across companies as shown by the figure below. While the median company is only referenced in $21$ articles over the sample period, the $5$ most popular companies (United Parcel Service [UPS], Apple [AAPL], Goldman Sachs [GS], Citibank [C], and Ford [F]) are all referenced in at least $1922$ different articles a piece. By comparison, the least popular $1{\scriptstyle \%}$ of companies are mentioned in only $1$ article in $5$ years.

Counting subject tags is a bit less straight-forward than counting articles. I not count tags that are specific to the WSJ rather than the company. e.g., tags containing “(wsj)” flagging daily features like “Abreast of the market (wsj).” I also remove missing subjects. It’s worth pointing out that sometimes the meta data for an article doesn’t contain any subject information. After restrictions, the data contain $10{\scriptstyle \mathrm{k}}$ unique subject tags.

The distribution of subject tag counts per month is similar to that of article counts as shown in the figure below but with a less pronounced downward trend. To create this figure, I count the number of unique subject tags used each month. e.g., if “technology shock” is used $2$ times in Jan 2008, then this counts as $1$ of the $1591$ tags used in this month; whereas, if “technology shock” is then used again on Feb $1$ st 2008, then I count this $3$ rd observation towards the total in February. Thus, the sum of the points in the time series will exceed $10{\scriptstyle \mathrm{k}}$ . Also, note that different articles can have identical subject tags.

As shown in the figure below, the distribution of subject tags used to describe articles about each company is less skewed than the actual article count for each company. There are $179$ different subject tags used in the $21$ articles about the median S&P 500 company during the sample period. The most tagged companies have $10$ times as many subjects as the median firm; whereas, the most written about companies are referenced in $100$ times as many articles as the median firm.

4. Articles per Tag

In order for the distribution of tags per company to be less skewed than the distribution of articles per company, it’s got to be the case that some tags are used in lots of articles. This is exactly what’s going on in the data. The figure below shows that the median subject tag is used in only $3$ articles and the bottom $25{\scriptstyle \%}$ of tags are used in only $1$ article; however, the top $1{\scriptstyle \%}$ of tags are used in $466$ articles or more. e.g., there are roughly $100$ tags out of the $10{\scriptstyle \mathrm{k}}$ unique subject tags in my data set that are used $500$ times are more. Likewise, there are well over $3000$ that are used only once!

This fact strongly supports the intuition that companies–even huge companies like those in the S&P 500—are constantly hit with new and different shocks. Traders have to figure out which aspect of the company matters. This is clearly not an easy problem to solve. Lot’s of ideas are thrown around. Many of them must be eitehr short lived or wrong. Roughly $1$ out of every $4$ topics worth discussing is only worth discussing once.

5. Coverage Depth

I conclude this post by looking at the variation in the number of subject tags across firms with a similar number of articles. e.g., I want to know if there are pairs of firms which journalist spend roughly the same amount of time talking about, but which get covered in very different ways. It turns out there are. The Garmin and Sprint example from the introduction is one such case. The figure below shows that there are many more. i.e., it shows that companies that are referenced in more articles also have more subject tag descriptors, but conditional on the number of articles there is still a lot of variation. The plot is on a $\log_{10} \times \log_{10}$ scale, so a $1$ tick vertical movement means a factor of $10$ difference between the number of tags for $2$ firms with similar article counts. Looking at the figure, it’s clear that this sort of variation is the norm.