Networking: Downloading Files From the Internet

People can talk to each other if they speak and understand the same language. Computers connected to a network can communicate with each other if they use the same protocol. For example, your Web browser uses HTTP protocol to send requests and receive responses from a remote server. Such a communication is possible if the program running on the server also understands HTTP. Any network protocol defines the rules of how a program should send or receive data.

Say, you want to find a book about Harry Potter in the online bookstore. Usually you’d need to enter Harry Potter in the search field of that bookstore Web site and press the button to start the search, but the browser sends a request with a lot more data than a dozen of characters you’ve entered. According to HTTP protocol, every request should include a request header, which defines the method of sending data, identifies the browser that sends it, requests to keep the connection open (or not), and has other attributes.

Similarly, a server should run a program that receives and understands the HTTP request, finds all books that have Harry Potter in the title, and sends a response back to you, adding a response header to the data. Both request and response headers can add several hundred bytes to the actual data being sent.

Some other popular networking protocols are TCP/IP, FTP, WebSocket. In this chapter you’ll learn how to write Java programs that can connect to remote servers and download data using HTTP protocol.

How Computers Find Each Other on the Web

If you want to visit a person, you need to know his or her exact address - the street, the building and apartment number. In the World Wide Web (WWW), to visit a Web site you need to know its Uniform Resource Locator (a URL). Let’s see which parts a URL consists of by using an example from Nickelodeon’s Web site:

This URL tells me that the host (the server) nick.com has the file called nickelodeon-football-stars.html located in the folder games. The port number plays the same role as an apartment number in a building.

The server nick.com can run many programs, which serve users' requests, but we want to enter this server through "the door" number 80. As a matter of fact, the port number 80 is a default port number for all Web servers that use HTTP. That’s why if you skip the port number, your Web browser will try to connect to the Web server via the port 80.

If a URL starts with https, the s stands for "secure". All requests and responses are automatically encrypted by the browser and the server for additional security. If the user enters a URL that starts with https and has no port number, the Web browser sends an HTTPS request to the server’s port 443.

Now let’s dissect another URL from Nickelodeon.

There is no file name here. What does video stand for in this URL? It’s not a file, and it’s not a folder on the host either. It’s just an ID that represents some resources at nick.com. There is a program running at nick.com that understands what this ID means and redirects the request to the application that serves the Web page with videos that may look like this:

Often the content that corresponds to a resource ID is not stored in a file, but is generated on the fly by a server-side program and can change daily, hourly, every minute or every second. Most likely by the time you enter http://www.nick.com/videos in your browser the content will be different.

Domain Names and IP Addresses

The name of the server (the host name) must be unique, and when a person or an organization registers a Web site, a unique number is assigned to this physical server. This number is called an IP address. The IP address is either a group of four numbers (e.g. 122.65.98.11) or up to eight hexadecimal numbers (e.g. 2001:cdba:0000:0000:0000:0000:3257:9652). Remembering these long numbers would be difficult for people hence why we’re using domain names that are easier to remember (e.g. nick.com or google.com).

When you’re entering a domain name in the browser, your Internet service provider uses special software to find the IP address that corresponds to the entered domain name. If found, your browser receives the requested Web page, which is a document in HTML format. If not found, you’ll get an error message that your search didn’t match any documents.

You can say,"But entering nick.com shows a lot of content even without providing any document name!" This happens because Web sites creators configure their servers to serve a certain HTML file if the user doesn’t specify any. Often they name the file with the home page content index.html, but it can be any other file too.

Most people connected to the Internet are getting dynamic (temporary) IP addresses assigned to their computers, which can change without notice.

Downloading Files From the Web

In Chapter 11 you’ve learned how to read a local file if you know its location. You’d open a stream pointing to this file and read it. The same holds true for reading remote files. The only difference is that you need open the stream over the network. Java offers a variety of classes for reading (or writing) remote files. All these classes are located in the package java.net. In this chapter we’ll use the classes that can exchange data between your computer and remote hosts using HTTP.

In particular, Java has a class java.net.URL that helps your program to connect to a remote computer on the Internet and get access to a resource there, provided that it’s not protected from public access.

Let’s see how you can start writing a program that creates an instance of the URL object that points at some remote resource, e.g. google.com:

try{
  URL myUrl = new URL("http://google.com");

    // More code will go here
}
catch(MalformedURLException ex){
      ex.getMessage();
}

The MalformedURLException is thrown if you made a typo in the code, e.g. you’ve typed htp instead of http or some spaces sneaked into the URL. If the MalformedURLException is thrown, it does not indicate that the remote server is down; just check "the spelling" of your URL.

Creation of the URL object does not establish a connection with the remote computer. If your program needs to read a remote file, you still need to open a stream on that file. Usually a program performs the following steps to read a file from the Internet via HTTP connection:

Create an instance of the class URL.
Create an instance of the URLConnection to open a connection using the URL object from Step 1.
Get a reference to the input stream of this object by calling the method getInputStream from URLConnection.
Read the data from the stream.

While writing code for reading streams over the network, you’ll have to handle possible I/O exceptions the same way you did while reading the local files in Chapter 11. The following program WebSiteReader reads and prints the code of google.com page.

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;

public class WebSiteReader {
 public static void main(String args[]){

   URL url = null;
   URLConnection urlConn = null;

   try
   {
       url  = new URL("http://www.google.com" );   // (1)

       urlConn = url.openConnection();             // (2)

   } catch( IOException e){
       System.out.println("Can't connect to the provided URL:" + e.toString() );
   }

   try( InputStreamReader inStream =              // (3)
        new InputStreamReader (urlConn.getInputStream(), "UTF8");
        BufferedReader buff  = new BufferedReader(inStream);){

       String currentLine;

       // Read and print the code of the Google's home page
       while ((currentLine = buff.readLine())!= null ){ //(4)

               System.out.println(currentLine);
       }
   } catch(MalformedURLException ex){
       System.out.println ("Check the spelling of the URL" + ex.getMessage());
   }
   catch(IOException  ioe){
       System.out.println("Can't read from the Internet: "+
               ioe.toString());
   }
 }
}

The WebSiteReader creates an instance of the class URL.
Then it gets a reference to an instance of the URLConnection object to open a connection with the stream.
After that WebSiteReader opens InputStreamReader, which is piped with BufferedReader.
The while loop reads the line from BufferedReader and if it’s not null, it prints the line on the console.

Make sure your computer is connected to the Internet before you run the WebSiteReader program. Actually, I was writing this program while sitting on the plane without the Internet connection. This is what the program printed up in the sky:

Can't read from the Internet: java.net.UnknownHostException: www.google.com

When my computer got the Internet connection the output was different. Here’s a fragment of what you can expect to see on the console after running WebSiteReader:

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="fr"><head><meta content="/logos/doodles/2015/110th-anniversary-of-first-publication-of-becassine-5701649318281216-hp.jpg" itemprop="image"><title>Google</title><script>(function(){window.google={kEI:'5OzPVMyJM4ukygPRz4CoBQ',kEXPI:'4011559,4013606,4020347,4020562,4021598,4022545,4023678,4024599,4024626,4025090,4027899,4027921,4028062,4028128,4028367,4028508,4028634,4028706,4028717,8300111,8500393,8500852,8501081,8501084,10200083,10200903,10200904',authuser:0,kSID:'5OzPVMyJM4ukygPRz4CoBQ'};google.kHL='us';})();(function(){google.lc=[];google.li=0;google.getEI=function(a){for(var b;a&&(!a.getAttribute||!(b=a.getAttribute("eid")));)a=a.parentNode;return b

This is a mix of HTML and JavaScript code. The class WebSiteReader explicitly creates the object URLConnection. Strictly speaking, you could achieve the same result by using only the class URL:

URL url = new URL("http://www.google.com");
InputStream in = url.openStream();
BufferedReader buff= new BufferedReader(new InputStreamReader(in));

The reason you may want use the URLConnection class is that it could give you some additional control over the I/O process. For example, by calling its method setDoOutput with the argument true you enable WebSiteReader to write to the remote URL. In this case you’d need to get a reference to an OutputStream object by calling getOutputStream method on the URLConnection object. If you wanted to write a program that can send data to the server, you’d need to learn server-side programming, which this book doesn’t cover.

Downloading Any File From the Internet

In Chapter 11 you’ve learned how to create a file and write into it. The WebSiteReader program just prints the remote data on the console, but you could have saved the data in the local file as well. It’s time to add the code that writes into a file. The goal is to write a program that can download any unprotected file (such as images, music, and binary files) available on the Web.

The following class FileDownload creates the URLConnection object and gets its InputStream. This class also creates an OutputStream to the local file. The URL and the local filename are given to FileDownload program as command-line arguments (explained in Chapter 11). The FileDownload connects to the provided URL, downloads its content, and saves it on local file.

import java.io.OutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;
import java.nio.file.Path;
import java.nio.file.Paths;

class FileDownload{

  public static void main(String args[]){
    if (args.length != 2){                 // (1)
      System.out.println(
            "Proper Usage: java FileDownload FileURL DestinationFileName");
      System.out.println(
            "For example: java FileDownload http://myflex.org/yf/nyc.jpg nyc.jpg");
      System.exit(-1);
    }

    URLConnection fileStream=null;

    try{
        URL remoteFile=new URL(args[0]);        // (2)

        fileStream=remoteFile.openConnection(); // (3)

    } catch (IOException ioe){
        ioe.printStackTrace();
    }

    Path path = Paths.get(args[1]);                // (4)

    try(OutputStream fOut=Files.newOutputStream(path); // (5)

      InputStream in = fileStream.getInputStream();){ // (6)

      System.out.println("Downloading from " + args[0] + ". Please wait...");

      // Read a remote file and save it in the local one

      int data;

      while((data=in.read())!=-1){      // (7)
          fOut.write(data);             // (8)
      }

      System.out.println("Finished downloading the file. It's located at "+path.toAbsolutePath());
    } catch (Exception e){
        e.printStackTrace();
    }
  }
}

This program starts by checking that two command-line arguments were provided. If not, it prints the message showing the right way to start FileDownload and exits the program by invoking the method exit on the System object.
Then the program creates an instance of the URL object using the URL provided as the first command-line argument.
Establishing a connection to the remote file.
Creating a Path object pointing to a local file where the downloaded data will be saved to.
Obtaining the OutputStream to the local file for writing.
Obtaining the InputStream to the remote file for reading.
Reading a byte from the InputStream.
Writing a byte to the OutputStream.

I’ve prepared an image for you that’s located at http://myflex.org/yf/nyc.jpg. It’s a photo that I took in the New York City some time ago. Run this program with the following two command-line arguments:

http://myflex.org/yf/nyc.jpg nyc.jpg

On my computer the console output looked like this:

Downloading from http://myflex.org/yf/nyc.jpg. Please wait...
Finished downloading the file. It's located at /Users/yfain11/IdeaProjects/jfk/Chapter12/nyc.jpg

The FileDowload program has downloaded the photo and saved it the file nyc.jpg.This program has no knowledge of what type of data it has downloaded - it was simply reading and writing the data byte after byte. In this case it was an image, but the same program can be used for downloading audio and other files that’s has open access to the public.

If you’ll open the downloaded file with a program that can show images, you’ll see the following photo:

In this chapter I’ve shown you how to download a file using HTTP. Professional Java developers use various techniques, technologies, and protocols for working with remote content. If you’re interested in exploring these advanced topics on your own, the Java EE Tutorial is a good start. Maybe one day I’ll write a book about the server side programming for kids.

Project: Downloading Music

In this project you’ll see how our FileDownload program to see if it can download an MP3 file as well.

Visit the Web site: http://www.last.fm/music/+free-music-downloads and pick an MP3 you like. At the time of this writing the Last.FM Web page looked as follows:
Right-click on the blue button "Free MP3", and you’ll see a browser’s popup menu. Select the menu item that allows you to copy the link address. The link will be copied in your computer’s clipboard. I’ve selected a link to a music file that looked like this:
```
http://freedownloads.last.fm/download/59565166/From%2BEmbrace%2BTo%2BEmbrace.mp3
```
In IDEA, select the menu Run | Edit Configuration and paste this link into the Program arguments field for the FileDownload program. This link will be the first command-line argument.
Move the cursor to the very end of the Program arguments field and add a space followed by the name of the local file, where you want to save the song. I’ve entered song1.mp3. Then press the button OK.

Run the program FileDownload. The MP3 file will be downloaded into the local file song1.mp3. Here’s the console output I’ve got:

Downloading from http://freedownloads.last.fm/download/59565166/From%2BEmbrace%2BTo%2BEmbrace.mp3. Please wait...
Finished downloading the file. It's located at /Users/yfain11/IdeaProjects/jfk/Chapter12/song1.mp3

Open this file in your MP3 player and enjoy the music!
Close IDEA and repeat the same exercise from the command line. You’ll need to open a Command (or Terminal) window, change the directory to where the file FileDownload.class is located. By default, IDEA stores compiled classes in the directory out/production. My IDEA project was named Chapter12, and this is where its compiled classes were located in my computer:
```
/Users/yfain11/IdeaProjects/jfk/Chapter12/out/production/Chapter12
```
Run the FileDownload program providing the URL and the name of the local file as command-line arguments. It should download the file the same way as it did in the IDEA IDE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chapter_12.adoc

Chapter_12.adoc

Networking: Downloading Files From the Internet

How Computers Find Each Other on the Web

Domain Names and IP Addresses

Downloading Files From the Web

Downloading Any File From the Internet

Project: Downloading Music

Files

Chapter_12.adoc

Latest commit

History

Chapter_12.adoc

File metadata and controls

Networking: Downloading Files From the Internet

How Computers Find Each Other on the Web

Domain Names and IP Addresses

Downloading Files From the Web

Downloading Any File From the Internet

Project: Downloading Music