Read from URL in Java
Read Text Contents from a URL
The class java.net.URL represents a URL (Uniform Resource Locator), a pointer to a “resource” on the Internet. For example, the following is a URL:
http://www.example.com/index.html
- “http” stands for HyperText Transfer Protocol
- “www.example.com” is the host machine name
- “index.html” is the file we are looking for
The following code creates a URL object:
URL url = new URL("http://www.example.com/");
The java.net.URL class has the following method openStream() which returns an input stream for reading from the source. It opens a connection to the URL and returns an InputStream for reading from that connection. This method is a shorthand for openConnection().getInputStream().
public final InputStream openStream() throws IOException
Using the input stream we can define a java.util.Scanner object for reading text contents from the URL.
Scanner scan = new Scanner( url.openStream() );
The following code reads text contents from a URL and prints out line by line.
URL url = new URL("http://www.example.com/");
InputStream in = url.openStream();
Scanner scan = new Scanner(in);
int line = 1;
while (scan.hasNext())
{
String str = scan.nextLine();
System.out.println( (line++) + ": " + str);
}
scan.close();
Example: Finding the Title in HTML
We wish to design a program that (1) asks the user for a URL, (2) retrieves HTML contents from the URL, and (3) finds the “title” from HTML. The title in an HTML is delimited by the tags <title></title>. The data flow of this program is: URL → HTML content → Title.
In the following ReadURLTitle class, we define a method readURLContent() to retrieve HTML contents as a string, and a method findTitle() to find the title in HTML.
import java.io.IOException;
import java.net.URL;
import java.util.Scanner;
public class ReadURLTitle
{
// Read from a URL and return the content in a String
public static String readURLContent(String urlString)
throws IOException
{
URL url = new URL(urlString);
Scanner scan = new Scanner(url.openStream());
String content = new String();
while (scan.hasNext())
content += scan.nextLine();
scan.close();
return content;
}
// Find title within the HTML content
public static String findTitle(String str)
{
String tagOpen = "<title>";
String tagClose = "</title>";
int begin = str.indexOf(tagOpen) + tagOpen.length();
int end = str.indexOf(tagClose);
return str.substring(begin, end);
}
public static void main(String[] args) throws IOException
{
Scanner scan = new Scanner(System.in);
System.out.println("Please type in a URL:");
String urlString = scan.nextLine();
if (urlString.length() == 0)
break;
String content = readURLContent(urlString);
String title = findTitle(content);
System.out.println(title);
}
}