Апач Спарк

Революция Hadoop связана с великолепной распределенной средой обработки данных и вычислений MapReduce и MapReduce v2 (YARN), построенной на основе HDFS. Тем не менее у него есть некоторые недостатки. Например, запись и чтение прерывистых результатов заданий на диск настолько сильно, что он страдает от чисто дисковой производительности, если базовые диски вращаются медленнее.

Чтобы решить эту проблему, появился один человек и революционизировал обработку данных и вычисления в Hadoop. Его зовут Матей Захария. Он нашел новый способ ускорить инфраструктуру MapReduce, разгрузив медленные части в RAM (оперативное запоминающее устройство) и сериализовав план выполнения задания в виде DAG (направленный ациклический граф) для мониторинга, безопасного повторного выполнения заданий и его сбойных частей.

Spark настолько сверкает, что его тут же берут в руки друзей Apache, и там продолжается разработка ядра. Параллельно с этим Матей и его друг основали Databricks для создания облачных приложений на основе Apache Spark для всех видов бизнес-потребностей в распределенной обработке данных и вычислениях.

Во второй части статьи Poleposition Core Components я представлю вам Apache Spark с открытым исходным кодом на платформе Poleposition Hadoop.

Прежде всего, если вы не знакомы с Poleposition, я бы порекомендовал вам попробовать установить его, загрузив установщик Poleposition: https://medium.com/@beartell/an-open-source-apache-bigtop-based-hadoop- установщик-7546d451e6fe

На платформе Poleposition вы можете использовать Apache Spark для любых задач обработки данных и вычислений, с которыми вы не можете справиться на одном компьютере. Вы можете инициировать свою работу со своего настольного компьютера, узла пограничного сервера или непосредственно в самом кластере Poleposition.

На всех узлах нет исключений, вы можете найти все инструменты Apache Spark, доступные на Poleposition. Ниже приведены два наиболее распространенных инструмента:

spark-submit
spark-shell

Вы можете отправить любое задание на платформу Poleposition, инициировав искровую отправку, будь то приложение Spark на основе Python, Scala или Java.

По умолчанию планировщик Poleposition YARN используется для отправки заданий Apache Spark, поэтому нам нужно поместить конфигурацию master=yarn в команду spark-submit, как показано ниже:

spark-submit
   --class org.apache.spark.examples.SparkPi
   --master yarn   --deploy-mode cluster
   --executor-memory 2G
   --num-executors 5
   /usr/lib/spark/examples/jars/spark-examples_2.12-3.1.2.jar
   1000

Это игрушечное задание вычисляет Pi по всему кластеру Poleposition, выдавая YARN в качестве планировщика заданий.

[vagrant@bds02 ~]$ spark-submit   --class org.apache.spark.examples.SparkPi   --master yarn   --deploy-mode cluster   --executor-memory 2G   --num-executors 5   /usr/lib/spark/examples/jars/spark-examples_2.12-3.1.2.jar   1000
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-reload4j-1.7.35.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-reload4j-1.7.35.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
22/12/27 07:05:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/12/27 07:05:14 INFO RMProxy: Connecting to ResourceManager at bds01.beartell.com/192.168.56.10:8032
22/12/27 07:05:14 INFO Client: Requesting a new application from cluster with 5 NodeManagers
22/12/27 07:05:15 INFO Configuration: resource-types.xml not found
22/12/27 07:05:15 INFO ResourceUtils: Unable to find 'resource-types.xml'.
22/12/27 07:05:15 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
22/12/27 07:05:15 INFO Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
22/12/27 07:05:15 INFO Client: Setting up container launch context for our AM
22/12/27 07:05:15 INFO Client: Setting up the launch environment for our AM container
22/12/27 07:05:15 INFO Client: Preparing resources for our AM container
22/12/27 07:05:15 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
22/12/27 07:05:15 INFO Client: Uploading resource file:/tmp/spark-65417959-39e2-4d97-9f7f-773b50c75034/__spark_libs__7106463166155452428.zip -> hdfs://bds01.beartell.com:8020/user/vagrant/.sparkStaging/application_1672123666896_0002/__spark_libs__7106463166155452428.zip
22/12/27 07:05:16 INFO Client: Uploading resource file:/usr/lib/spark/examples/jars/spark-examples_2.12-3.1.2.jar -> hdfs://bds01.beartell.com:8020/user/vagrant/.sparkStaging/application_1672123666896_0002/spark-examples_2.12-3.1.2.jar
22/12/27 07:05:16 INFO Client: Uploading resource file:/tmp/spark-65417959-39e2-4d97-9f7f-773b50c75034/__spark_conf__4652289559033692147.zip -> hdfs://bds01.beartell.com:8020/user/vagrant/.sparkStaging/application_1672123666896_0002/__spark_conf__.zip
22/12/27 07:05:16 INFO SecurityManager: Changing view acls to: vagrant
22/12/27 07:05:16 INFO SecurityManager: Changing modify acls to: vagrant
22/12/27 07:05:16 INFO SecurityManager: Changing view acls groups to: 
22/12/27 07:05:16 INFO SecurityManager: Changing modify acls groups to: 
22/12/27 07:05:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(vagrant); groups with view permissions: Set(); users  with modify permissions: Set(vagrant); groups with modify permissions: Set()
22/12/27 07:05:16 INFO Client: Submitting application application_1672123666896_0002 to ResourceManager
22/12/27 07:05:16 INFO YarnClientImpl: Submitted application application_1672123666896_0002
22/12/27 07:05:17 INFO Client: Application report for application_1672123666896_0002 (state: ACCEPTED)
22/12/27 07:05:17 INFO Client: 
  client token: N/A
  diagnostics: AM container is launched, waiting for AM container to Register with RM
  ApplicationMaster host: N/A
  ApplicationMaster RPC port: -1
  queue: default
  start time: 1672124716628
  final status: UNDEFINED
  tracking URL: http://bds01.beartell.com:20888/proxy/application_1672123666896_0002/
  user: vagrant
22/12/27 07:05:18 INFO Client: Application report for application_1672123666896_0002 (state: ACCEPTED)
22/12/27 07:05:19 INFO Client: Application report for application_1672123666896_0002 (state: ACCEPTED)
22/12/27 07:05:20 INFO Client: Application report for application_1672123666896_0002 (state: RUNNING)
22/12/27 07:05:20 INFO Client: 
  client token: N/A
  diagnostics: N/A
  ApplicationMaster host: bds02.beartell.com
  ApplicationMaster RPC port: 37857
  queue: default
  start time: 1672124716628
  final status: UNDEFINED
  tracking URL: http://bds01.beartell.com:20888/proxy/application_1672123666896_0002/
  user: vagrant
22/12/27 07:05:21 INFO Client: Application report for application_1672123666896_0002 (state: RUNNING)
22/12/27 07:05:22 INFO Client: Application report for application_1672123666896_0002 (state: RUNNING)
22/12/27 07:05:23 INFO Client: Application report for application_1672123666896_0002 (state: RUNNING)
22/12/27 07:05:24 INFO Client: Application report for application_1672123666896_0002 (state: RUNNING)
22/12/27 07:05:25 INFO Client: Application report for application_1672123666896_0002 (state: RUNNING)
22/12/27 07:05:26 INFO Client: Application report for application_1672123666896_0002 (state: RUNNING)
22/12/27 07:05:27 INFO Client: Application report for application_1672123666896_0002 (state: RUNNING)
22/12/27 07:05:28 INFO Client: Application report for application_1672123666896_0002 (state: FINISHED)
22/12/27 07:05:28 INFO Client: 
  client token: N/A
  diagnostics: N/A
  ApplicationMaster host: bds02.beartell.com
  ApplicationMaster RPC port: 37857
  queue: default
  start time: 1672124716628
  final status: SUCCEEDED
  tracking URL: http://bds01.beartell.com:20888/proxy/application_1672123666896_0002/
  user: vagrant
22/12/27 07:05:28 INFO ShutdownHookManager: Shutdown hook called
22/12/27 07:05:28 INFO ShutdownHookManager: Deleting directory /tmp/spark-65417959-39e2-4d97-9f7f-773b50c75034
22/12/27 07:05:28 INFO ShutdownHookManager: Deleting directory /tmp/spark-2be29757-a56b-4913-9aad-92109e9b2a9b

В конце выполнения он сообщает о результатах работы через монитор приложений YARN. Он называется «URL-адрес отслеживания». Он используется для отслеживания заданий YARN, независимо от того, возвращает ли оно ошибку или нет.

http://bds01.beartell.com:20888/proxy/application_1672123666896_0002

И если вы нажмете ссылку «Журналы» на этой странице, она покажет вам stdout, stderr и другие выходные данные уровня журнала для дальнейшего анализа этого приложения. В нашей игрушечной работе он покажет вам рассчитанное значение Pi в журналах stdout. Давайте проверим это ниже.

 Log Type: directory.info

Log Upload Time: Tue Dec 27 07:05:29 +0000 2022

Log Length: 32867

Showing 4096 bytes of 32867 total. Click here for the full log.

-x------   1  yarn     yarn        52684 Dec 27 07:05 ./__spark_libs__/spark-repl_2.12-3.1.2.jar
 43059925     32 -r-x------   1  yarn     yarn        30496 Dec 27 07:05 ./__spark_libs__/spark-sketch_2.12-3.1.2.jar
 43059926   7372 -r-x------   1  yarn     yarn      7547617 Dec 27 07:05 ./__spark_libs__/spark-sql_2.12-3.1.2.jar
 43059927     16 -r-x------   1  yarn     yarn        15071 Dec 27 07:05 ./__spark_libs__/transaction-api-1.1.jar
 43059928   1116 -r-x------   1  yarn     yarn      1141331 Dec 27 07:05 ./__spark_libs__/spark-streaming_2.12-3.1.2.jar
 43059929     16 -r-x------   1  yarn     yarn        15155 Dec 27 07:05 ./__spark_libs__/spark-tags_2.12-3.1.2.jar
 43059930     52 -r-x------   1  yarn     yarn        51457 Dec 27 07:05 ./__spark_libs__/spark-unsafe_2.12-3.1.2.jar
 43059931    340 -r-x------   1  yarn     yarn       346697 Dec 27 07:05 ./__spark_libs__/spark-yarn_2.12-3.1.2.jar
 43059932    440 -r-x------   1  yarn     yarn       447005 Dec 27 07:05 ./__spark_libs__/univocity-parsers-2.9.1.jar
 43059933     80 -r-x------   1  yarn     yarn        79588 Dec 27 07:05 ./__spark_libs__/spire-macros_2.12-0.17.0-M1.jar
 50357430      0 drwx------   3  yarn     yarn          158 Dec 27 07:05 ./__spark_conf__
 50357431      4 -r-x------   1  yarn     yarn         2371 Dec 27 07:05 ./__spark_conf__/log4j.properties
 54536278      4 drwx------   2  yarn     yarn         4096 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__
 54536279      4 -r-x------   1  yarn     yarn         3130 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/mapred-site.xml
 54536280      4 -r-x------   1  yarn     yarn         3635 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/hadoop-env.sh
 54536281     16 -r-x------   1  yarn     yarn        14713 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/log4j.properties
 54536282      4 -r-x------   1  yarn     yarn         3321 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/hadoop-metrics2.properties
 54536283      4 -r-x------   1  yarn     yarn         3410 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/yarn-site.xml
 54536284      4 -r-x------   1  yarn     yarn         2421 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/core-site.xml
 54536285      4 -r-x------   1  yarn     yarn           10 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/workers
 54536286     12 -r-x------   1  yarn     yarn         9213 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/capacity-scheduler.xml
 54536287      4 -r-x------   1  yarn     yarn         1335 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/configuration.xsl
 54536288      4 -r-x------   1  yarn     yarn         2359 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/hdfs-site.xml
 54536289      8 -r-x------   1  yarn     yarn         6272 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/yarn-env.sh
 54536290      4 -r-x------   1  yarn     yarn           95 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/container-executor.cfg
 54536291      4 -r-x------   1  yarn     yarn         1764 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/mapred-env.sh
 54536292      4 -r-x------   1  yarn     yarn         2316 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/ssl-client.xml.example
 54536293     12 -r-x------   1  yarn     yarn        11392 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/hadoop-policy.xml
 54536294      4 -r-x------   1  yarn     yarn         2697 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/ssl-server.xml.example
 54536295      4 -r-x------   1  yarn     yarn          128 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/taskcontroller.cfg
 54536296      8 -r-x------   1  yarn     yarn         4113 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/mapred-queues.xml.template
 50357432    200 -r-x------   1  yarn     yarn       202750 Dec 27 07:05 ./__spark_conf__/__spark_hadoop_conf__.xml
 50357433      4 -r-x------   1  yarn     yarn          561 Dec 27 07:05 ./__spark_conf__/__spark_conf__.properties
 50357434      4 -r-x------   1  yarn     yarn          676 Dec 27 07:05 ./__spark_conf__/__spark_dist_cache__.properties
broken symlinks(find -L . -maxdepth 5 -type l -ls):


Log Type: launch_container.sh

Log Upload Time: Tue Dec 27 07:05:29 +0000 2022

Log Length: 5217

Showing 4096 bytes of 5217 total. Click here for the full log.

TTP_PORT="8042"
export LOCAL_DIRS="/data/yarn/usercache/vagrant/appcache/application_1672123666896_0002"
export LOCAL_USER_DIRS="/data/yarn/usercache/vagrant/"
export LOG_DIRS="/var/log/hadoop-yarn/containers/application_1672123666896_0002/container_1672123666896_0002_01_000001"
export USER="vagrant"
export LOGNAME="vagrant"
export HOME="/home/"
export PWD="/data/yarn/usercache/vagrant/appcache/application_1672123666896_0002/container_1672123666896_0002_01_000001"
export LOCALIZATION_COUNTERS="197404759,0,3,0,668"
export JVM_PID="$$"
export NM_AUX_SERVICE_mapreduce_shuffle="AAA0+gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="
export SPARK_YARN_STAGING_DIR="hdfs://bds01.beartell.com:8020/user/vagrant/.sparkStaging/application_1672123666896_0002"
export APPLICATION_WEB_PROXY_BASE="/proxy/application_1672123666896_0002"
export SPARK_DIST_CLASSPATH="/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*:/usr/lib/hadoop-yarn/./:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*"
export CLASSPATH="$PWD:$PWD/__spark_conf__:$PWD/__spark_libs__/*:/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*:/usr/lib/hadoop-yarn/./:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:$PWD/__spark_conf__/__hadoop_conf__"
export APP_SUBMIT_TIME_ENV="1672124716628"
export SPARK_USER="vagrant"
export PYTHONHASHSEED="0"
export MALLOC_ARENA_MAX="4"
echo "Setting up job resources"
ln -sf -- "/data/yarn/usercache/vagrant/filecache/15/spark-examples_2.12-3.1.2.jar" "__app__.jar"
ln -sf -- "/data/yarn/usercache/vagrant/filecache/13/__spark_libs__7106463166155452428.zip" "__spark_libs__"
ln -sf -- "/data/yarn/usercache/vagrant/filecache/14/__spark_conf__.zip" "__spark_conf__"
echo "Copying debugging information"
# Creating copy of launch script
cp "launch_container.sh" "/var/log/hadoop-yarn/containers/application_1672123666896_0002/container_1672123666896_0002_01_000001/launch_container.sh"
chmod 640 "/var/log/hadoop-yarn/containers/application_16721236664: 0.32949933409690857
0: 0.25188517570495605
3: 0.020412154495716095
2: 0.013244825415313244
9: -0.05387714132666588896_0002/container_1672123666896_0002_01_000001/launch_container.sh"
# Determining directory contents
echo "ls -l:" 1>"/var/log/hadoop-yarn/containers/application_1672123666896_0002/container_1672123666896_0002_01_000001/directory.info"
ls -l 1>>"/var/log/hadoop-yarn/containers/application_1672123666896_0002/container_1672123666896_0002_01_000001/directory.info"
echo "find -L . -maxdepth 5 -ls:" 1>>"/var/log/hadoop-yarn/containers/application_1672123666896_0002/container_1672123666896_0002_01_000001/directory.info"
find -L . -maxdepth 5 -ls 1>>"/var/log/hadoop-yarn/containers/application_1672123666896_0002/container_1672123666896_0002_01_000001/directory.info"
echo "broken symlinks(find -L . -maxdepth 5 -type l -ls):" 1>>"/var/log/hadoop-yarn/containers/application_1672123666896_0002/container_1672123666896_0002_01_000001/directory.info"
find -L . -maxdepth 5 -type l -ls 1>>"/var/log/hadoop-yarn/containers/application_1672123666896_0002/container_1672123666896_0002_01_000001/directory.info"
echo "Launching container"
exec /bin/bash -c "LD_LIBRARY_PATH=\"/usr/lib/hadoop/lib/native:$LD_LIBRARY_PATH\" $JAVA_HOME/bin/java -server -Xmx1024m -Djava.io.tmpdir=$PWD/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1672123666896_0002/container_1672123666896_0002_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.examples.SparkPi' --jar file:/usr/lib/spark/examples/jars/spark-examples_2.12-3.1.2.jar --arg '1000' --properties-file $PWD/__spark_conf__/__spark_conf__.properties --dist-cache-conf $PWD/__spark_conf__/__spark_dist_cache__.properties 1> /var/log/hadoop-yarn/containers/application_1672123666896_0002/container_1672123666896_0002_01_000001/stdout 2> /var/log/hadoop-yarn/containers/application_1672123666896_0002/container_1672123666896_0002_01_000001/stderr"


Log Type: prelaunch.err

Log Upload Time: Tue Dec 27 07:05:29 +0000 2022

Log Length: 0


Log Type: prelaunch.out

Log Upload Time: Tue Dec 27 07:05:29 +0000 2022

Log Length: 100

Setting up env variables
Setting up job resources
Copying debugging information
Launching container


Log Type: stderr

Log Upload Time: Tue Dec 27 07:05:29 +0000 2022

Log Length: 342839

Showing 4096 bytes of 342839 total. Click here for the full log.

.0 (TID 996) (bds02.beartell.com, executor 5, partition 996, PROCESS_LOCAL, 4567 bytes) taskResourceAssignments Map()
22/12/27 07:05:27 INFO TaskSetManager: Starting task 997.0 in stage 0.0 (TID 997) (bds03.beartell.com, executor 3, partition 997, PROCESS_LOCAL, 4567 bytes) taskResourceAssignments Map()
22/12/27 07:05:27 INFO TaskSetManager: Finished task 990.0 in stage 0.0 (TID 990) in 7 ms on bds01.beartell.com (executor 1) (990/1000)
22/12/27 07:05:27 INFO TaskSetManager: Finished task 989.0 in stage 0.0 (TID 989) in 17 ms on bds02.beartell.com (executor 5) (991/1000)
22/12/27 07:05:27 INFO TaskSetManager: Finished task 992.0 in stage 0.0 (TID 992) in 7 ms on bds03.beartell.com (executor 3) (992/1000)
22/12/27 07:05:27 INFO TaskSetManager: Finished task 991.0 in stage 0.0 (TID 991) in 7 ms on bds05.beartell.com (executor 4) (993/1000)
22/12/27 07:05:27 INFO TaskSetManager: Starting task 998.0 in stage 0.0 (TID 998) (bds04.beartell.com, executor 2, partition 998, PROCESS_LOCAL, 4567 bytes) taskResourceAssignments Map()
22/12/27 07:05:27 INFO TaskSetManager: Starting task 999.0 in stage 0.0 (TID 999) (bds01.beartell.com, executor 1, partition 999, PROCESS_LOCAL, 4567 bytes) taskResourceAssignments Map()
22/12/27 07:05:27 INFO TaskSetManager: Finished task 994.0 in stage 0.0 (TID 994) in 8 ms on bds05.beartell.com (executor 4) (994/1000)
22/12/27 07:05:27 INFO TaskSetManager: Finished task 996.0 in stage 0.0 (TID 996) in 8 ms on bds02.beartell.com (executor 5) (995/1000)
22/12/27 07:05:27 INFO TaskSetManager: Finished task 997.0 in stage 0.0 (TID 997) in 7 ms on bds03.beartell.com (executor 3) (996/1000)
22/12/27 07:05:27 INFO TaskSetManager: Finished task 993.0 in stage 0.0 (TID 993) in 14 ms on bds04.beartell.com (executor 2) (997/1000)
22/12/27 07:05:27 INFO TaskSetManager: Finished task 995.0 in stage 0.0 (TID 995) in 8 ms on bds01.beartell.com (executor 1) (998/1000)
22/12/27 07:05:27 INFO TaskSetManager: Finished task 999.0 in stage 0.0 (TID 999) in 6 ms on bds01.beartell.com (executor 1) (999/1000)
22/12/27 07:05:27 INFO TaskSetManager: Finished task 998.0 in stage 0.0 (TID 998) in 10 ms on bds04.beartell.com (executor 2) (1000/1000)
22/12/27 07:05:27 INFO YarnClusterScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 
22/12/27 07:05:27 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 3.993 s
22/12/27 07:05:27 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
22/12/27 07:05:27 INFO YarnClusterScheduler: Killing all running tasks in stage 0: Stage finished
22/12/27 07:05:27 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 4.082941 s
22/12/27 07:05:27 INFO SparkUI: Stopped Spark web UI at http://bds02.beartell.com:46611
22/12/27 07:05:27 INFO YarnClusterSchedulerBackend: Shutting down all executors
22/12/27 07:05:27 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
22/12/27 07:05:27 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/12/27 07:05:27 INFO MemoryStore: MemoryStore cleared
22/12/27 07:05:27 INFO BlockManager: BlockManager stopped
22/12/27 07:05:27 INFO BlockManagerMaster: BlockManagerMaster stopped
22/12/27 07:05:27 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/12/27 07:05:27 INFO SparkContext: Successfully stopped SparkContext
22/12/27 07:05:27 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
22/12/27 07:05:27 INFO ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED
22/12/27 07:05:27 INFO AMRMClientImpl: Waiting for application to be successfully unregistered.
22/12/27 07:05:27 INFO ApplicationMaster: Deleting staging directory hdfs://bds01.beartell.com:8020/user/vagrant/.sparkStaging/application_1672123666896_0002
22/12/27 07:05:27 INFO ShutdownHookManager: Shutdown hook called
22/12/27 07:05:27 INFO ShutdownHookManager: Deleting directory /data/yarn/usercache/vagrant/appcache/application_1672123666896_0002/spark-2e058340-f7fd-4c93-ba63-d942ef271166


Log Type: stdout

Log Upload Time: Tue Dec 27 07:05:29 +0000 2022

Log Length: 31

Pi is roughly 3.14183903141839
Log Type: stdout

Log Upload Time: Tue Dec 27 07:05:29 +0000 2022

Log Length: 31

Pi is roughly 3.14183903141839

Помимо spark-submit, вы можете программно использовать Apache Spark в своем собственном приложении Python, получать сеанс Spark и отправлять интерактивные вычислительные нагрузки в кластере Poleposition, а также выполнять рабочие процессы машинного обучения, как показано ниже:

from pyspark.sql import SparkSession
from pyspark.mllib.feature import Word2Vec

if __name__ == "__main__":
    
    spark = SparkSession\
        .builder\
        .config("spark.hadoop.yarn.resourcemanager.address", "bds01.beartell.com:8032")\
        .config("spark.yarn.stagingDir", "/tmp/")\
        .master("yarn")\
        .appName("Word2VecExample")\
        .getOrCreate()
    sc = spark.sparkContext

    inp = sc.textFile("sample_lda_data.txt").map(lambda row: row.split(" "))

    word2vec = Word2Vec()
    model = word2vec.fit(inp)

    synonyms = model.findSynonyms('1', 5)

    for word, cosine_distance in synonyms:
        print("{}: {}".format(word, cosine_distance))

    sc.stop()

В этом случае мы запускаем приложение Spark с экрана терминала и просматриваем выходные данные на этом терминале.

[vagrant@bds02 ~]$ python3 Word2vec.py 
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-12-27 08:23:05,585 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2022-12-27 08:23:06,673 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
/usr/local/lib/python3.6/site-packages/pyspark/context.py:238: FutureWarning: Python 3.6 support is deprecated in Spark 3.2.
  FutureWarning
2022-12-27 08:23:17,831 WARN netlib.InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
2022-12-27 08:23:17,835 WARN netlib.InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS
4: 0.32949933409690857
0: 0.25188517570495605
3: 0.020412154495716095
2: 0.013244825415313244
9: -0.05387714132666588

Это конец нашей части II. Надеюсь, вам понравится эта статья и всем РОЖДЕСТВО!