1. What is Apache Pig primarily used for in Hadoop?
Answer:
Explanation:
Apache Pig is a platform used for analyzing large data sets. It provides an abstract way to program MapReduce tasks with its own scripting language, Pig Latin.
2. Which language is Pig scripts written in?
Answer:
Explanation:
Pig scripts are written in Pig Latin, a high-level data flow language that offers a rich set of data types and operators for performing various data operations.
3. What is the primary advantage of using Pig over traditional MapReduce?
Answer:
Explanation:
The primary advantage of using Pig over traditional MapReduce is its lower learning curve, due to its abstraction from Java MapReduce programming model and simpler scripting language.
4. In Pig, which of the following is a complex data type?
Answer:
Explanation:
In Pig, 'map' is a complex data type. Others like int, float, and chararray are simple or primitive data types.
5. Which operation does the 'GROUP' command perform in Pig?
Answer:
Explanation:
The 'GROUP' command in Pig is used to group data in one or more relations by one or more fields.
6. What does the 'LOAD' function do in Pig?
Answer:
Explanation:
The 'LOAD' function in Pig is used to load data from the file system (like HDFS) into a relation or table for processing.
7. What is a Bag in Pig Latin?
Answer:
Explanation:
In Pig Latin, a Bag is a complex data type that represents a collection of tuples which can have duplicate elements.
8. How does Pig interact with Hadoop's MapReduce?
Answer:
Explanation:
Pig translates Pig Latin scripts into a series of MapReduce jobs, which are then run on a Hadoop cluster.
9. Which of the following best describes a Tuple in Pig?
Answer:
Explanation:
In Pig, a Tuple is an ordered set of fields, which can be of different data types. It represents a single row in a relation.
10. What is the function of the 'FOREACH … GENERATE' statement in Pig?
Answer:
Explanation:
The 'FOREACH … GENERATE' statement in Pig is used to iterate over each tuple in a bag and transform it into a new tuple.
11. What role does the 'FILTER' command play in Pig?
Answer:
Explanation:
The 'FILTER' command in Pig is used to select tuples in a dataset that meet a specified condition.
12. What is UDF in the context of Pig?
Answer:
Explanation:
In Pig, UDF stands for User Defined Function. UDFs allow users to write custom functions to extend Pig's functionality.
13. Which command is used to view the schema of a relation in Pig?
Answer:
Explanation:
The 'DESCRIBE' command in Pig is used to view the schema of a relation, showing the names and data types of its fields.
14. How are Pig Latin scripts typically executed?
Answer:
Explanation:
Pig Latin scripts are typically executed in a Hadoop cluster. They are translated into MapReduce jobs that run on the cluster.
15. Which data model does Pig primarily use?
Answer:
Explanation:
Pig uses a relational data model, working with data sets that are similar to tables in a relational database.
16. What is the main difference between the 'STORE' and 'DUMP' commands in Pig?
Answer:
Explanation:
The 'STORE' command in Pig is used to write data from a relation to the file system (like HDFS), whereas 'DUMP' displays the contents of a relation to the screen for viewing.
17. What is Pig's execution environment called?
Answer:
Explanation:
The Grunt shell is the interactive command line interface for running Pig scripts and commands.
18. What is the significance of a 'JOIN' operation in Pig?
Answer:
Explanation:
The 'JOIN' operation in Pig is used to combine two or more datasets based on a common field, similar to the JOIN operation in SQL.
19. What does 'COGROUP' do in Pig Latin?
Answer:
Explanation:
The 'COGROUP' operation in Pig is used to group two or more relations by a common field, creating a new relation where each group is a tuple containing the common field and bags of tuples from each relation.
20. How can Pig scripts be optimized for performance?
Answer:
Explanation:
Performance of Pig scripts can be optimized by choosing efficient data types, minimizing data skew, and using operations that reduce the amount of data processed and transferred across the network.
21. What does the 'SPLIT' command do in Pig?
Answer:
Explanation:
The 'SPLIT' command in Pig is used to split a single dataset into two or more relations based on specified conditions.
22. What is the primary use of the 'UNION' operation in Pig?
Answer:
Explanation:
The 'UNION' operation in Pig is used to combine two or more datasets into one dataset, concatenating their records.
23. Which of the following is a correct use of the 'LIMIT' operator in Pig?
Answer:
Explanation:
The 'LIMIT' operator in Pig is used to restrict the output to a specified number of tuples, effectively limiting the size of the result.
24. In Pig, what is the role of the 'DISTINCT' operator?
Answer:
Explanation:
The 'DISTINCT' operator in Pig is used to remove duplicate records from a data set, ensuring that each tuple in the output is unique.
25. How does Pig handle null values in its operations?
Answer:
Explanation:
Pig is designed to handle null values gracefully. It supports operations on null values, treating them distinctly from other values, and provides functions to deal with nulls effectively.