The 2020 Democratic candidates for President will face off in debates starting Wednesday. Many of them are current or former members of Congress. All of them are vying to lead the country.
As voters, wouldn’t it be extraordinary if we had a record of everything that they had said on the floor of the Senate or the House, over the course of their careers as politicians?
And as data scientists, wouldn’t we like to extract their words, analyze them, and use them to make judgments or predictions about these Congresspeople as Presidential candidates?
Yes, we can. The Constitution requires Congress to keep a journal of its proceedings. The Government Publishing Office thus prints (and digitally posts) the Congressional Record, which contains the daily official proceedings of the legislature’s two chambers, including:
Although the data live in the public domain, getting them out of the website and into a usable format poses a bit of a challenge. Without access to pricy legal databases, web scraping is the best option for an enterprising member of the public, and Scrapy makes it relatively painless to get a lot of information quickly.
Scrapy allows for asynchronous web scraping with python. You can use it to extract data using APIs, integrate it with BeautifulSoup and Selenium, and extend its capabilities in as many ways as a spider web has filaments.
Scrapy’s central conceit is
copying Django a “don’t repeat yourself” framework, meaning it provides a way to reuse code and easily scale projects up to a larger scope. The component parts of the crawler, such as the items, middlewares, pipelines, and settings, live in separate scripts, and multiple “spiders” that crawl different websites can use them within the same “project”.
The spiders themselves, naturally, rely on object-oriented programming (each is of the class “Spider”):
The vital components are:
One of Scrapy’s most useful features is the Scrapy shell, which allows you to explore the website you are scraping in real time to test your assumptions. Essentially, you get to try your code out in sandbox mode before you deploy it (and find out it doesn’t work).
When you’re working with something as complex as XPath, this addition vastly reduces the time and frustration of drilling down into the structure of a website to extract the content you need. For instance, I needed to fetch partial URLs and text from particular elements within the HTML on the Congress.gov website. The Scrapy shell let me ensure my XPath syntax did not return empty lists before I copied and pasted the resultant syntax into my code.
A note on “copy XPath” from DevTools: While you may find it enticing to simply right-click on an HTML element in DevTools and select “Copy XPath” from the menu that appears, do not succumb to temptation. If, like Congress, your site is organized in tables, you’ll fail to retrieve everything you want. In my case, the PDFs I wanted are located in the far right column of the page:
Clicking “Copy XPath” gives me the following:
This is just describing a position within a table.
Here’s what that returns in the shell:
Here’s what my actual XPath expression is:
And here’s what it returns, for a given page:
You must learn to use XPath to select the meaningful content, rather than rely on automatically generated expressions, which––rather like the governmental body in question––often table function in favor of form.
Once I had written the spider, it wasted remarkably little time in downloading 25 years’ worth of the Congressional Record to my hard drive. Not too much additional code later, and I had extracted the text with Tika, a python library that processes PDFs.
I created a regular expression to split out individual speeches:
And tried out some very preliminary sentiment analysis:
What’s next for Congress, now that their words are laid bare? Analysis and data mining, Natural(Language Processing)ly. While much ado has been made about Twitter, hardly a covfefe has yet been raised about what Senators and Representatives have said on the floor of Congress, whether because of the difficulty of obtaining the data, the trickiness of training the models, the lack of legal literacy, or all of the above. But with the proper tools and domain knowledge, I hope to provide some valuable insights ahead of the 2020 election season.