Philip Youssef bio photo

Philip Youssef

Engineering @LendUpLoans. Previously @Twitter, @Groupon, @Microsoft.
Pasta enthusiast & restless programmer.

Twitter LinkedIn Github Stackoverflow

Have you ever thought to yourself: “I have an idea for a great app! If only website X had an API I could use”.

But that api doesn’t exist.

You then think you yourself: “I could build a web scraper to get the data!”.

And you remember the last time you tried to do that, and how much of a pain it was to get set up.

I too was in this position, trying to build an app that needed access to tech headlines from any point in time. I decided to take a swing at building a web service that scrapes techmeme.com and serves up the results in a JSON rendered API. The result was achieved in a mere 70 lines of code which I’ll walk through below along with an expose of some really powerful libraries I think everyone should know about.

This post is a love letter (it being Valentine’s day and all) to the following:

  1. Immutables — Used to create our models. By using a handful of very powerful annotations along with code generation, we’ll be able to create immutable objects and builders to represent our data models.
  2. Jsoup — Used for retrieving html and parsing it. This is an older library that has stood the test of time. Simply pass in css selectors to get the relevant html sections needed.
  3. Pippo — Used as our web framework. This is a relatively new web framework for java that combines a very simple interface with a minimal footprint and a high degree of customizability. Reminds me of a Dropwizard with a simpler interface.
  4. Java 8 — We’ll be making use of Java 8 streams and optionals to process the incoming data.

The code for everything I’m about to go through can be found here on github.

Step 1: Create the data model

1
2
3
4
5
6
7
8
9
10
11
@Value.Immutable
@JsonSerialize(as = ImmutableHeadline.class)
public interface Headline {

    String reporter();
    String source();
    String title();
    String summary();
    String url();
    Optional<List<ImmutableHeadline>> relatedHeadlines();
}

We need a way to build, store and serialize into json the headlines we get. The above model, built using the Immutables library, does all of this! The @Value.Immutable annotation will generate an ImmutableHeadline along with an associated builder we can use to construct the object. The @JsonSerialize annotation will tell Pippo how to serialize this object into json. Note that a headline can optionally have related headlines; these aren’t required to build a headline and will not render if left out.

Step 2: Create the scraper

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
public class HeadlineController extends Controller {

    private static final String baseUrl = "http://www.techmeme.com/";

    public void getHeadlines(@Param("date") String headlineDate) {
        try {
            Document doc = Jsoup.connect(generateWebUrl(headlineDate)).get();
            Elements topLinks = doc.select("#topcol1 .clus");

            List<ImmutableHeadline> headlines = topLinks.stream()
                .map(x -> {
                    Element titleBar = x.select(".shrtbl cite").first();
                    Element mainStory = x.select(".itc1 .itc2 .item .ii").first();

                    String reporter = titleBar.ownText().split(" / ")[0];
                    String source = titleBar.select("a").first().ownText();
                    String title = mainStory.select("strong a").first().ownText();
                    String summary = mainStory.ownText();
                    String url = mainStory.select("strong a").attr("href");

                    return ImmutableHeadline.builder()
                        .reporter(reporter)
                        .source(source)
                        .title(title)
                        .summary(summary)
                        .url(url)
                        .build();
                })
                .collect(Collectors.toList());

            getResponse().json(headlines);
        } catch (Exception e) {
            getResponse().internalError();
        }
    }

We’ve created a HeadlineController.getHeadlines() method which renders the top headlines for a given date as json. Specifically we:

  1. Call Jsoup.connect(url).get() to make an http request to get us a html document.
  2. Using the CSS selector "#topcol1 .clus", get all list of all top links (html blocks relating to individual headlines).
  3. For each headline, extract the fields we care about using more CSS selectors. Then use ImmutableHeadline.builder() to map the html into an object.
  4. Have Pippo render the list of headlines as json by calling getResponse().json(headlines);

Step 3: Create the web server and hook it all up

Creating a web app using Pippo is as simple as:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public class ApiWebDemo {
    
    public static void main(String[] args) {
        Pippo pippo = new Pippo(new ApiWebApp());
        pippo.start();
    }
    
    static class ApiWebApp extends ControllerApplication {
        @Override
        protected void onInit() {
            GET("/headlines", HeadlineController.class, "getHeadlines");
        }
    }
}

That’s it! We define a ApiWebApp class with a single route /headlines which will call the getHeadlines method on our HeadlineController above. Our main function starts the server by simply calling pippo.start(), and we’re ready to starting taking requests.

Sample output:

Let’s get the tech headlines for New Year’s day 2015: http://localhost:8081/headlines?date=2015-01-01

And the results are in:

[
  {
    reporter: "Sarah Frier",
    source: "Bloomberg",
    title: "Snapchat raises $485.6M at $10B+ valuation from 23 investors",
    summary: "  —  Snapchat Raises $485.6 Million to Close Out Big Fundraising Year  —  Snapchat Inc., among a pack of elite technology startups that has attained a valuation of $10 billion or more, capped the year with a filing that disclosed it raised $485.6 million.",
    url: "http://www.bloomberg.com/news/2015-01-01/snapchat-raises-485-6-million-to-close-out-big-fundraising-year.html"
  },
  {
    reporter: "William Turton",
    source: "The Daily Dot",
    title: "U.K. police allegedly arrest Lizard Squad hacker",
    summary: "… Lizard Squad took credit for the Dec. 25 distributed denial-of-service (DDoS) attacks against the PlayStation Network and Xbox Live.  DDoS attacks overwhelm a network with too much traffic, leaving targeted networks inaccessible for legitimate users.",
    url: "http://www.dailydot.com/crime/lizard-squad-vinnie-omari-arrested/"
  }
]