How to configure a self managed Spark Datasource

This article is for comments to:

Please comment +1 if this How to is important to you.

I am looking for documentation on how to use Spark datasource (and files are stored remotely say S3).
I found two articles below. Is the first article about using a local Spark for validation, and the second method uses a Spark cluster? So if I want the validation to be scalable(larger amount of data), should I follow the second article? Is it true the first article’s method will only be useful for validating small amount of data (since it seems to be using a local Spark and will involve downloading data from remote location during validation)?

  1. How to configure a PySpark datasource for accessing the data from AWS S3?