Read from URL in Java

Read Text Contents from a URL

The class java.net.URL represents a URL (Uniform Resource Locator), a pointer to a “resource” on the Internet. For example, the following is a URL:
http://www.example.com/index.html

  • “http” stands for HyperText Transfer Protocol
  • “www.example.com” is the host machine name
  • “index.html” is the file we are looking for

The following code creates a URL object:

URL url = new URL("http://www.example.com/");

The java.net.URL class has the following method openStream() which returns an input stream for reading from the source. It opens a connection to the URL and returns an InputStream for reading from that connection. This method is a shorthand for openConnection().getInputStream().

public final InputStream openStream() throws IOException

Using the input stream we can define a java.util.Scanner object for reading text contents from the URL.

Scanner scan = new Scanner( url.openStream() );

The following code reads text contents from a URL and prints out line by line.

URL url = new URL("http://www.example.com/");
InputStream in = url.openStream();
Scanner scan = new Scanner(in);

int line = 1;
while (scan.hasNext())
{
    String str = scan.nextLine();
    System.out.println( (line++) + ": " + str);
}
scan.close();

Example: Finding the Title in HTML

We wish to design a program that (1) asks the user for a URL, (2) retrieves HTML contents from the URL, and (3) finds the “title” from HTML. The title in an HTML is delimited by the tags <title></title>. The data flow of this program is: URL → HTML content → Title.

In the following ReadURLTitle class, we define a method readURLContent() to retrieve HTML contents as a string, and a method findTitle() to find the title in HTML.

import java.io.IOException;
import java.net.URL;
import java.util.Scanner;


public class ReadURLTitle
{
    // Read from a URL and return the content in a String
    public static String readURLContent(String urlString) 
                                    throws IOException
    {
        URL url = new URL(urlString);
        Scanner scan = new Scanner(url.openStream());

        String content = new String();
        while (scan.hasNext())
            content += scan.nextLine();
        scan.close();
        return content;
    }
	
    // Find title within the HTML content
    public static String findTitle(String str)
    {
        String tagOpen = &quot;&lt;title&gt;&quot;;
        String tagClose = &quot;&lt;/title&gt;&quot;;
		
        int begin = str.indexOf(tagOpen) + tagOpen.length();
        int end = str.indexOf(tagClose);
        return str.substring(begin, end);
    }
	
    public static void main(String[] args) throws IOException 
    {
        Scanner scan = new Scanner(System.in);
        System.out.println(&quot;Please type in a URL:&quot;);
        String	urlString = scan.nextLine();
        if (urlString.length() == 0)
            break;
			
        String content = readURLContent(urlString);
        String title = findTitle(content);
        System.out.println(title);
    }
}

References

Comments

comments