Apache Pig


Homework Assignment # 4 – Chapter 5 Apache Pig
Please complete your Chapter 5 Homework by submitting your work for Exercise 5.1, 5.2, and
5.3 from your textbook. You will find all of the necessary data files for these exercises posted
under “Chapter 5 Pig Exercise Files.zip” that is posted on Canvas under your CS 480/591
modules. You can submit screenshots within a single Microsoft Word document to submit proof
of your work for each of these three Exercises.
For Exercise 5.1 – Using Pig Latin commands, perform the following operations:
1. Upload the two files to HDFS.
2. Load the files as investors and stock_prices.
3. Display both files to make sure they are loaded correctly.
4. Join the two files (investors and stock prices) by stock symbol.
5. Display the joined file.
6. Group the joined file by the “last name” of the investors and display the results.
7. Calculate the total shares (simply the sum of shares among all stocks) and display the
results.
8. Calculate the total dollar amount that each investor has invested (shares per each stock
multiplied by the stock price) and display the results.
9. Filter the top two investors who have invested the most and display the results.
For Exercise 5.2 – Use the same files as in Exercise 5-1 (investor.txt and stockprice.txt) to
perform the following actions:
1. Upload the two files to HDFS (if not uploaded before).
2. Load the files as investors and stock prices.
3. Display both files to make sure they are loaded correctly.
4. Display the structure of relation investors and stock prices.
5. Perform explain on stock prices and observe the logical, physical, and MapReduce
execution plans.
6. Group investors by their stock symbols that they have purchased and display the results.
7. Combine both relations with a union command and display the results.
For Exercise 5.3 – Use the text excerpt shown in Figure 5-31.
Use Pig Latin operators to perform the Wordcount problem (See Tutorial Part C in the
PowerPoint Lecture for Apache Pig we went through in Class) and find the frequency of the
words in the following text file.