HTML Agility Pack–Windows 8

So in the previous post, we’ve talked about parsing HTML in Windows Phone 8 projects. And I got couple of questions about doing the same on Windows 8. Now in Windows 8, it’s going to be a little bit different.

If you want to read more about why we need this “HTML Agility Pack” library, please read the introduction on the previous post. In this post I will get right into the point of parsing the HTML.

In this tutorial I’m going to try to get the content out of IMDB website. The reason I picked this website that they don’t have any APIs, and they have a rich content. (But again, the usage policy doesn’t allow us to use it for production, so we will use it for learning purposes J)

So the web page I’m going to parse is the “movies in theaters” page, which shows the latest movies:

http://www.imdb.com/movies-in-theaters/

Let’s start by creating a sample project, Start your Visual Studio 2013, and create a new Windows Store project.

Now we need to create the model of our project, so we can hold the data we get back from the imdb website. Create a class called Movie, and add three attributes like this:

    public class Movie  
    {
        public string Title { get; set; }
        public string Cover { get; set; }
        public string Summary { get; set; }
    }
Now we need to create a simple view. In our MainPage.xaml, we will add a GridView and edit the ItemsTemplate so we can show the movies there, your xaml code should be like this:
    <Grid Background="{ThemeResource ApplicationPageBackgroundThemeBrush}">  
        <Grid.RowDefinitions>
            <RowDefinition Height="140" />
            <RowDefinition Height="*" />
        Grid.RowDefinitions>
        <GridView x:Name="lstMovies" Grid.Row="1">
            <GridView.ItemTemplate>
                <DataTemplate>
                    <Grid Width="300" Height="200" Margin="5">
                        <Grid.ColumnDefinitions>
                            <ColumnDefinition />
                            <ColumnDefinition />
                        Grid.ColumnDefinitions>
                        <Image Source="{Binding Cover}" />
                        <Grid Grid.Column="1">
                            <Grid.RowDefinitions>
                                <RowDefinition Height="Auto" />
                                <RowDefinition Height="*" />
                            Grid.RowDefinitions>
                            <TextBlock Text="{Binding Title}" />
                            <TextBlock Grid.Row="1" TextWrapping="Wrap" Text="{Binding Summary}" />
                        Grid>
                    Grid>
                DataTemplate>
            GridView.ItemTemplate>
        GridView>
    Grid>
So far what we did is simply creating a Model, and design the view. We need now to do the real work by downloading the html page, and parse it, then create the List of the movies. Let’s start by downloading the HTML page. First we need to download the **HttpClient** library, we could’ve used WebClient or HttpWebRequest, but HttpClient has an advantage that its available as a Portable Class Library, so it would be easier if we want to port the code later to other platforms. Also all the methods are async. Right click on the Project, Click on Manage NuGet Packages, and search for “HttpClient”. Click on the Install. [![clip_image004](http://www.tareqateik.com/Media/Default/Windows-Live-Writer/HTML-Agility-PackWindows-Phone-8_13D4D/clip_image004_thumb.jpg "clip_image004")](http://www.tareqateik.com/Media/Default/Windows-Live-Writer/HTML-Agility-PackWindows-Phone-8_13D4D/clip_image004_2.jpg) Now override the OnNavigatedTo method to download the html page:
string htmlPage = ""; 

using (var client = new HttpClient())  
{ 
    htmlPage = await client.GetStringAsync("http://www.imdb.com/movies-in-theaters/"); 
} 

Let’s go back to NuGet Windows to Install the HTML Agility Pack, right click on the solution, and click Manage NuGet packages. Search for “Html Agility Pack” and install it.

clip_image006

Now to understand the structure of our page, we’re going to use the developer tools on Internet Explorer. On Internet Explorer, navigate to the imdb page, and right click anywhere, and click “Inspect element”.

On “DOM Explorer”, notice when you hover over the html source code, Internet Explorer highlights the visual part represented by that code. In our case, we’re trying to get the list of the movies.

Looking through the code, we realize we want to get all the divs with class “list_item”

image

For each movie, we need to extract the title, cover and the summary.

For the image, I kept digging, and I found that I want to access a div with a class image, then inside it there’s an anchor (a tag), then another div and finally an img tag.

Doing the same to get the Title and the summary of the movie, I found this:

Title: h4 tag with a itempropname
Summary: div with a classoutline

Now realize, I didn’t have to check all the tags to the way down to the content, because using HTML Agility Pack and XPATH, I can skip all tags and get the “InnerText” directly

Lets start implementing that in Code, we need first to convert the Html source code we downloaded to an Html document:

HtmlDocument htmlDocument = new HtmlDocument();  
htmlDocument.LoadHtml(htmlPage);  
And then we have to get the div that contains all the movies: 

List movies = new List(); 

foreach (var div in htmlDocument.DocumentNode.Descendants().Where(i => i.Name  "div" && i.GetAttributeValue("class", "").StartsWith("list_item")))  
{
    Movie newMovie = new Movie();
    newMovie.Cover = div.Descendants().Where(i=>i.Name"div" && i.GetAttributeValue("class","")  "image").FirstOrDefault().Descendants().Where(i=>i.Name  "img").FirstOrDefault().GetAttributeValue("src","");
    newMovie.Title = div.Descendants().Where(i => i.Name  "h4" && i.GetAttributeValue("itemprop", "")  "name").FirstOrDefault().InnerText.Trim();
    newMovie.Summary = div.Descendants().Where(i => i.Name  "div" && i.GetAttributeValue("class", "")  "outline").FirstOrDefault().InnerText.Trim();
    movies.Add(newMovie);
} 

Unlike the Windows Phone XPath way, we simply use the Name attribute to check the tag name, and use the GetAttributeValue method to access any attribute (class, src, href …etc).

Finally we set the ItemsSource of the listbox from the View with the movies:

lstMovies.ItemsSource = movies; 

Running the app will get you this view:

image

Summary

So we’ve seen how powerful this library is, you can “literally” build any application you want out of a website. I’ve been using this library for a while, and here I write some hints that would be useful:
1- Read the usage policy of the website. I know this is the third time I mention that, but that tells you its important :)
2- Always try to use the mobile website as it will generate a smaller html pages, so faster load time on your app (for example http://m.imdb.com)
3- The library is powerful, but if the structure of the website changes, then your app will crash. But you can fix that by following the hint number 4
4- Instead of using the HTML Agility Pack on the client side, you can create a server that parses the content from the website, and provides the content to the clients as JSON. In case the structure of the website changes the clients will still have the old data, and you can fix your server to adapt these changes. Also notice using this module will make your app runs faster, as JSON files would be more concise than HTML pages.
image

You can download the source code from here:

http://sdrv.ms/185Btyn

Tareq Ateik

Read more posts by this author.