Апач Спарк
Революция Hadoop связана с великолепной распределенной средой обработки данных и вычислений MapReduce и MapReduce v2 (YARN), построенной на основе HDFS. Тем не менее у него есть некоторые недостатки. Например, запись и чтение прерывистых результатов заданий на диск настолько сильно, что он страдает от чисто дисковой производительности, если базовые диски вращаются медленнее.
Чтобы решить эту проблему, появился один человек и революционизировал обработку данных и вычисления в Hadoop. Его зовут Матей Захария. Он нашел новый способ ускорить инфраструктуру MapReduce, разгрузив медленные части в RAM (оперативное запоминающее устройство) и сериализовав план выполнения задания в виде DAG (направленный ациклический граф) для мониторинга, безопасного повторного выполнения заданий и его сбойных частей.
Spark настолько сверкает, что его тут же берут в руки друзей Apache, и там продолжается разработка ядра. Параллельно с этим Матей и его друг основали Databricks для создания облачных приложений на основе Apache Spark для всех видов бизнес-потребностей в распределенной обработке данных и вычислениях.
Во второй части статьи Poleposition Core Components я представлю вам Apache Spark с открытым исходным кодом на платформе Poleposition Hadoop.
Прежде всего, если вы не знакомы с Poleposition, я бы порекомендовал вам попробовать установить его, загрузив установщик Poleposition: https://medium.com/@beartell/an-open-source-apache-bigtop-based-hadoop- установщик-7546d451e6fe
На платформе Poleposition вы можете использовать Apache Spark для любых задач обработки данных и вычислений, с которыми вы не можете справиться на одном компьютере. Вы можете инициировать свою работу со своего настольного компьютера, узла пограничного сервера или непосредственно в самом кластере Poleposition.
На всех узлах нет исключений, вы можете найти все инструменты Apache Spark, доступные на Poleposition. Ниже приведены два наиболее распространенных инструмента:
spark-submit spark-shell
Вы можете отправить любое задание на платформу Poleposition, инициировав искровую отправку, будь то приложение Spark на основе Python, Scala или Java.
По умолчанию планировщик Poleposition YARN используется для отправки заданий Apache Spark, поэтому нам нужно поместить конфигурацию master=yarn в команду spark-submit, как показано ниже:
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --executor-memory 2G --num-executors 5 /usr/lib/spark/examples/jars/spark-examples_2.12-3.1.2.jar 1000
Это игрушечное задание вычисляет Pi по всему кластеру Poleposition, выдавая YARN в качестве планировщика заданий.
[vagrant@bds02 ~]$ spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --executor-memory 2G --num-executors 5 /usr/lib/spark/examples/jars/spark-examples_2.12-3.1.2.jar 1000 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-reload4j-1.7.35.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-reload4j-1.7.35.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 22/12/27 07:05:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 22/12/27 07:05:14 INFO RMProxy: Connecting to ResourceManager at bds01.beartell.com/192.168.56.10:8032 22/12/27 07:05:14 INFO Client: Requesting a new application from cluster with 5 NodeManagers 22/12/27 07:05:15 INFO Configuration: resource-types.xml not found 22/12/27 07:05:15 INFO ResourceUtils: Unable to find 'resource-types.xml'. 22/12/27 07:05:15 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container) 22/12/27 07:05:15 INFO Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead 22/12/27 07:05:15 INFO Client: Setting up container launch context for our AM 22/12/27 07:05:15 INFO Client: Setting up the launch environment for our AM container 22/12/27 07:05:15 INFO Client: Preparing resources for our AM container 22/12/27 07:05:15 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 22/12/27 07:05:15 INFO Client: Uploading resource file:/tmp/spark-65417959-39e2-4d97-9f7f-773b50c75034/__spark_libs__7106463166155452428.zip -> hdfs://bds01.beartell.com:8020/user/vagrant/.sparkStaging/application_1672123666896_0002/__spark_libs__7106463166155452428.zip 22/12/27 07:05:16 INFO Client: Uploading resource file:/usr/lib/spark/examples/jars/spark-examples_2.12-3.1.2.jar -> hdfs://bds01.beartell.com:8020/user/vagrant/.sparkStaging/application_1672123666896_0002/spark-examples_2.12-3.1.2.jar 22/12/27 07:05:16 INFO Client: Uploading resource file:/tmp/spark-65417959-39e2-4d97-9f7f-773b50c75034/__spark_conf__4652289559033692147.zip -> hdfs://bds01.beartell.com:8020/user/vagrant/.sparkStaging/application_1672123666896_0002/__spark_conf__.zip 22/12/27 07:05:16 INFO SecurityManager: Changing view acls to: vagrant 22/12/27 07:05:16 INFO SecurityManager: Changing modify acls to: vagrant 22/12/27 07:05:16 INFO SecurityManager: Changing view acls groups to: 22/12/27 07:05:16 INFO SecurityManager: Changing modify acls groups to: 22/12/27 07:05:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(vagrant); groups with view permissions: Set(); users with modify permissions: Set(vagrant); groups with modify permissions: Set() 22/12/27 07:05:16 INFO Client: Submitting application application_1672123666896_0002 to ResourceManager 22/12/27 07:05:16 INFO YarnClientImpl: Submitted application application_1672123666896_0002 22/12/27 07:05:17 INFO Client: Application report for application_1672123666896_0002 (state: ACCEPTED) 22/12/27 07:05:17 INFO Client: client token: N/A diagnostics: AM container is launched, waiting for AM container to Register with RM ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1672124716628 final status: UNDEFINED tracking URL: http://bds01.beartell.com:20888/proxy/application_1672123666896_0002/ user: vagrant 22/12/27 07:05:18 INFO Client: Application report for application_1672123666896_0002 (state: ACCEPTED) 22/12/27 07:05:19 INFO Client: Application report for application_1672123666896_0002 (state: ACCEPTED) 22/12/27 07:05:20 INFO Client: Application report for application_1672123666896_0002 (state: RUNNING) 22/12/27 07:05:20 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: bds02.beartell.com ApplicationMaster RPC port: 37857 queue: default start time: 1672124716628 final status: UNDEFINED tracking URL: http://bds01.beartell.com:20888/proxy/application_1672123666896_0002/ user: vagrant 22/12/27 07:05:21 INFO Client: Application report for application_1672123666896_0002 (state: RUNNING) 22/12/27 07:05:22 INFO Client: Application report for application_1672123666896_0002 (state: RUNNING) 22/12/27 07:05:23 INFO Client: Application report for application_1672123666896_0002 (state: RUNNING) 22/12/27 07:05:24 INFO Client: Application report for application_1672123666896_0002 (state: RUNNING) 22/12/27 07:05:25 INFO Client: Application report for application_1672123666896_0002 (state: RUNNING) 22/12/27 07:05:26 INFO Client: Application report for application_1672123666896_0002 (state: RUNNING) 22/12/27 07:05:27 INFO Client: Application report for application_1672123666896_0002 (state: RUNNING) 22/12/27 07:05:28 INFO Client: Application report for application_1672123666896_0002 (state: FINISHED) 22/12/27 07:05:28 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: bds02.beartell.com ApplicationMaster RPC port: 37857 queue: default start time: 1672124716628 final status: SUCCEEDED tracking URL: http://bds01.beartell.com:20888/proxy/application_1672123666896_0002/ user: vagrant 22/12/27 07:05:28 INFO ShutdownHookManager: Shutdown hook called 22/12/27 07:05:28 INFO ShutdownHookManager: Deleting directory /tmp/spark-65417959-39e2-4d97-9f7f-773b50c75034 22/12/27 07:05:28 INFO ShutdownHookManager: Deleting directory /tmp/spark-2be29757-a56b-4913-9aad-92109e9b2a9b
В конце выполнения он сообщает о результатах работы через монитор приложений YARN. Он называется «URL-адрес отслеживания». Он используется для отслеживания заданий YARN, независимо от того, возвращает ли оно ошибку или нет.
http://bds01.beartell.com:20888/proxy/application_1672123666896_0002
И если вы нажмете ссылку «Журналы» на этой странице, она покажет вам stdout, stderr и другие выходные данные уровня журнала для дальнейшего анализа этого приложения. В нашей игрушечной работе он покажет вам рассчитанное значение Pi в журналах stdout. Давайте проверим это ниже.
Log Type: directory.info Log Upload Time: Tue Dec 27 07:05:29 +0000 2022 Log Length: 32867 Showing 4096 bytes of 32867 total. Click here for the full log. -x------ 1 yarn yarn 52684 Dec 27 07:05 ./__spark_libs__/spark-repl_2.12-3.1.2.jar 43059925 32 -r-x------ 1 yarn yarn 30496 Dec 27 07:05 ./__spark_libs__/spark-sketch_2.12-3.1.2.jar 43059926 7372 -r-x------ 1 yarn yarn 7547617 Dec 27 07:05 ./__spark_libs__/spark-sql_2.12-3.1.2.jar 43059927 16 -r-x------ 1 yarn yarn 15071 Dec 27 07:05 ./__spark_libs__/transaction-api-1.1.jar 43059928 1116 -r-x------ 1 yarn yarn 1141331 Dec 27 07:05 ./__spark_libs__/spark-streaming_2.12-3.1.2.jar 43059929 16 -r-x------ 1 yarn yarn 15155 Dec 27 07:05 ./__spark_libs__/spark-tags_2.12-3.1.2.jar 43059930 52 -r-x------ 1 yarn yarn 51457 Dec 27 07:05 ./__spark_libs__/spark-unsafe_2.12-3.1.2.jar 43059931 340 -r-x------ 1 yarn yarn 346697 Dec 27 07:05 ./__spark_libs__/spark-yarn_2.12-3.1.2.jar 43059932 440 -r-x------ 1 yarn yarn 447005 Dec 27 07:05 ./__spark_libs__/univocity-parsers-2.9.1.jar 43059933 80 -r-x------ 1 yarn yarn 79588 Dec 27 07:05 ./__spark_libs__/spire-macros_2.12-0.17.0-M1.jar 50357430 0 drwx------ 3 yarn yarn 158 Dec 27 07:05 ./__spark_conf__ 50357431 4 -r-x------ 1 yarn yarn 2371 Dec 27 07:05 ./__spark_conf__/log4j.properties 54536278 4 drwx------ 2 yarn yarn 4096 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__ 54536279 4 -r-x------ 1 yarn yarn 3130 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/mapred-site.xml 54536280 4 -r-x------ 1 yarn yarn 3635 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/hadoop-env.sh 54536281 16 -r-x------ 1 yarn yarn 14713 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/log4j.properties 54536282 4 -r-x------ 1 yarn yarn 3321 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/hadoop-metrics2.properties 54536283 4 -r-x------ 1 yarn yarn 3410 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/yarn-site.xml 54536284 4 -r-x------ 1 yarn yarn 2421 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/core-site.xml 54536285 4 -r-x------ 1 yarn yarn 10 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/workers 54536286 12 -r-x------ 1 yarn yarn 9213 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/capacity-scheduler.xml 54536287 4 -r-x------ 1 yarn yarn 1335 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/configuration.xsl 54536288 4 -r-x------ 1 yarn yarn 2359 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/hdfs-site.xml 54536289 8 -r-x------ 1 yarn yarn 6272 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/yarn-env.sh 54536290 4 -r-x------ 1 yarn yarn 95 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/container-executor.cfg 54536291 4 -r-x------ 1 yarn yarn 1764 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/mapred-env.sh 54536292 4 -r-x------ 1 yarn yarn 2316 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/ssl-client.xml.example 54536293 12 -r-x------ 1 yarn yarn 11392 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/hadoop-policy.xml 54536294 4 -r-x------ 1 yarn yarn 2697 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/ssl-server.xml.example 54536295 4 -r-x------ 1 yarn yarn 128 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/taskcontroller.cfg 54536296 8 -r-x------ 1 yarn yarn 4113 Dec 27 07:05 ./__spark_conf__/__hadoop_conf__/mapred-queues.xml.template 50357432 200 -r-x------ 1 yarn yarn 202750 Dec 27 07:05 ./__spark_conf__/__spark_hadoop_conf__.xml 50357433 4 -r-x------ 1 yarn yarn 561 Dec 27 07:05 ./__spark_conf__/__spark_conf__.properties 50357434 4 -r-x------ 1 yarn yarn 676 Dec 27 07:05 ./__spark_conf__/__spark_dist_cache__.properties broken symlinks(find -L . -maxdepth 5 -type l -ls): Log Type: launch_container.sh Log Upload Time: Tue Dec 27 07:05:29 +0000 2022 Log Length: 5217 Showing 4096 bytes of 5217 total. Click here for the full log. TTP_PORT="8042" export LOCAL_DIRS="/data/yarn/usercache/vagrant/appcache/application_1672123666896_0002" export LOCAL_USER_DIRS="/data/yarn/usercache/vagrant/" export LOG_DIRS="/var/log/hadoop-yarn/containers/application_1672123666896_0002/container_1672123666896_0002_01_000001" export USER="vagrant" export LOGNAME="vagrant" export HOME="/home/" export PWD="/data/yarn/usercache/vagrant/appcache/application_1672123666896_0002/container_1672123666896_0002_01_000001" export LOCALIZATION_COUNTERS="197404759,0,3,0,668" export JVM_PID="$$" export NM_AUX_SERVICE_mapreduce_shuffle="AAA0+gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=" export SPARK_YARN_STAGING_DIR="hdfs://bds01.beartell.com:8020/user/vagrant/.sparkStaging/application_1672123666896_0002" export APPLICATION_WEB_PROXY_BASE="/proxy/application_1672123666896_0002" export SPARK_DIST_CLASSPATH="/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*:/usr/lib/hadoop-yarn/./:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*" export CLASSPATH="$PWD:$PWD/__spark_conf__:$PWD/__spark_libs__/*:/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*:/usr/lib/hadoop-yarn/./:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:$PWD/__spark_conf__/__hadoop_conf__" export APP_SUBMIT_TIME_ENV="1672124716628" export SPARK_USER="vagrant" export PYTHONHASHSEED="0" export MALLOC_ARENA_MAX="4" echo "Setting up job resources" ln -sf -- "/data/yarn/usercache/vagrant/filecache/15/spark-examples_2.12-3.1.2.jar" "__app__.jar" ln -sf -- "/data/yarn/usercache/vagrant/filecache/13/__spark_libs__7106463166155452428.zip" "__spark_libs__" ln -sf -- "/data/yarn/usercache/vagrant/filecache/14/__spark_conf__.zip" "__spark_conf__" echo "Copying debugging information" # Creating copy of launch script cp "launch_container.sh" "/var/log/hadoop-yarn/containers/application_1672123666896_0002/container_1672123666896_0002_01_000001/launch_container.sh" chmod 640 "/var/log/hadoop-yarn/containers/application_16721236664: 0.32949933409690857 0: 0.25188517570495605 3: 0.020412154495716095 2: 0.013244825415313244 9: -0.05387714132666588896_0002/container_1672123666896_0002_01_000001/launch_container.sh" # Determining directory contents echo "ls -l:" 1>"/var/log/hadoop-yarn/containers/application_1672123666896_0002/container_1672123666896_0002_01_000001/directory.info" ls -l 1>>"/var/log/hadoop-yarn/containers/application_1672123666896_0002/container_1672123666896_0002_01_000001/directory.info" echo "find -L . -maxdepth 5 -ls:" 1>>"/var/log/hadoop-yarn/containers/application_1672123666896_0002/container_1672123666896_0002_01_000001/directory.info" find -L . -maxdepth 5 -ls 1>>"/var/log/hadoop-yarn/containers/application_1672123666896_0002/container_1672123666896_0002_01_000001/directory.info" echo "broken symlinks(find -L . -maxdepth 5 -type l -ls):" 1>>"/var/log/hadoop-yarn/containers/application_1672123666896_0002/container_1672123666896_0002_01_000001/directory.info" find -L . -maxdepth 5 -type l -ls 1>>"/var/log/hadoop-yarn/containers/application_1672123666896_0002/container_1672123666896_0002_01_000001/directory.info" echo "Launching container" exec /bin/bash -c "LD_LIBRARY_PATH=\"/usr/lib/hadoop/lib/native:$LD_LIBRARY_PATH\" $JAVA_HOME/bin/java -server -Xmx1024m -Djava.io.tmpdir=$PWD/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1672123666896_0002/container_1672123666896_0002_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.examples.SparkPi' --jar file:/usr/lib/spark/examples/jars/spark-examples_2.12-3.1.2.jar --arg '1000' --properties-file $PWD/__spark_conf__/__spark_conf__.properties --dist-cache-conf $PWD/__spark_conf__/__spark_dist_cache__.properties 1> /var/log/hadoop-yarn/containers/application_1672123666896_0002/container_1672123666896_0002_01_000001/stdout 2> /var/log/hadoop-yarn/containers/application_1672123666896_0002/container_1672123666896_0002_01_000001/stderr" Log Type: prelaunch.err Log Upload Time: Tue Dec 27 07:05:29 +0000 2022 Log Length: 0 Log Type: prelaunch.out Log Upload Time: Tue Dec 27 07:05:29 +0000 2022 Log Length: 100 Setting up env variables Setting up job resources Copying debugging information Launching container Log Type: stderr Log Upload Time: Tue Dec 27 07:05:29 +0000 2022 Log Length: 342839 Showing 4096 bytes of 342839 total. Click here for the full log. .0 (TID 996) (bds02.beartell.com, executor 5, partition 996, PROCESS_LOCAL, 4567 bytes) taskResourceAssignments Map() 22/12/27 07:05:27 INFO TaskSetManager: Starting task 997.0 in stage 0.0 (TID 997) (bds03.beartell.com, executor 3, partition 997, PROCESS_LOCAL, 4567 bytes) taskResourceAssignments Map() 22/12/27 07:05:27 INFO TaskSetManager: Finished task 990.0 in stage 0.0 (TID 990) in 7 ms on bds01.beartell.com (executor 1) (990/1000) 22/12/27 07:05:27 INFO TaskSetManager: Finished task 989.0 in stage 0.0 (TID 989) in 17 ms on bds02.beartell.com (executor 5) (991/1000) 22/12/27 07:05:27 INFO TaskSetManager: Finished task 992.0 in stage 0.0 (TID 992) in 7 ms on bds03.beartell.com (executor 3) (992/1000) 22/12/27 07:05:27 INFO TaskSetManager: Finished task 991.0 in stage 0.0 (TID 991) in 7 ms on bds05.beartell.com (executor 4) (993/1000) 22/12/27 07:05:27 INFO TaskSetManager: Starting task 998.0 in stage 0.0 (TID 998) (bds04.beartell.com, executor 2, partition 998, PROCESS_LOCAL, 4567 bytes) taskResourceAssignments Map() 22/12/27 07:05:27 INFO TaskSetManager: Starting task 999.0 in stage 0.0 (TID 999) (bds01.beartell.com, executor 1, partition 999, PROCESS_LOCAL, 4567 bytes) taskResourceAssignments Map() 22/12/27 07:05:27 INFO TaskSetManager: Finished task 994.0 in stage 0.0 (TID 994) in 8 ms on bds05.beartell.com (executor 4) (994/1000) 22/12/27 07:05:27 INFO TaskSetManager: Finished task 996.0 in stage 0.0 (TID 996) in 8 ms on bds02.beartell.com (executor 5) (995/1000) 22/12/27 07:05:27 INFO TaskSetManager: Finished task 997.0 in stage 0.0 (TID 997) in 7 ms on bds03.beartell.com (executor 3) (996/1000) 22/12/27 07:05:27 INFO TaskSetManager: Finished task 993.0 in stage 0.0 (TID 993) in 14 ms on bds04.beartell.com (executor 2) (997/1000) 22/12/27 07:05:27 INFO TaskSetManager: Finished task 995.0 in stage 0.0 (TID 995) in 8 ms on bds01.beartell.com (executor 1) (998/1000) 22/12/27 07:05:27 INFO TaskSetManager: Finished task 999.0 in stage 0.0 (TID 999) in 6 ms on bds01.beartell.com (executor 1) (999/1000) 22/12/27 07:05:27 INFO TaskSetManager: Finished task 998.0 in stage 0.0 (TID 998) in 10 ms on bds04.beartell.com (executor 2) (1000/1000) 22/12/27 07:05:27 INFO YarnClusterScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 22/12/27 07:05:27 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 3.993 s 22/12/27 07:05:27 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job 22/12/27 07:05:27 INFO YarnClusterScheduler: Killing all running tasks in stage 0: Stage finished 22/12/27 07:05:27 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 4.082941 s 22/12/27 07:05:27 INFO SparkUI: Stopped Spark web UI at http://bds02.beartell.com:46611 22/12/27 07:05:27 INFO YarnClusterSchedulerBackend: Shutting down all executors 22/12/27 07:05:27 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down 22/12/27 07:05:27 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 22/12/27 07:05:27 INFO MemoryStore: MemoryStore cleared 22/12/27 07:05:27 INFO BlockManager: BlockManager stopped 22/12/27 07:05:27 INFO BlockManagerMaster: BlockManagerMaster stopped 22/12/27 07:05:27 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 22/12/27 07:05:27 INFO SparkContext: Successfully stopped SparkContext 22/12/27 07:05:27 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 22/12/27 07:05:27 INFO ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED 22/12/27 07:05:27 INFO AMRMClientImpl: Waiting for application to be successfully unregistered. 22/12/27 07:05:27 INFO ApplicationMaster: Deleting staging directory hdfs://bds01.beartell.com:8020/user/vagrant/.sparkStaging/application_1672123666896_0002 22/12/27 07:05:27 INFO ShutdownHookManager: Shutdown hook called 22/12/27 07:05:27 INFO ShutdownHookManager: Deleting directory /data/yarn/usercache/vagrant/appcache/application_1672123666896_0002/spark-2e058340-f7fd-4c93-ba63-d942ef271166 Log Type: stdout Log Upload Time: Tue Dec 27 07:05:29 +0000 2022 Log Length: 31 Pi is roughly 3.14183903141839 Log Type: stdout Log Upload Time: Tue Dec 27 07:05:29 +0000 2022 Log Length: 31 Pi is roughly 3.14183903141839
Помимо spark-submit, вы можете программно использовать Apache Spark в своем собственном приложении Python, получать сеанс Spark и отправлять интерактивные вычислительные нагрузки в кластере Poleposition, а также выполнять рабочие процессы машинного обучения, как показано ниже:
from pyspark.sql import SparkSession from pyspark.mllib.feature import Word2Vec if __name__ == "__main__": spark = SparkSession\ .builder\ .config("spark.hadoop.yarn.resourcemanager.address", "bds01.beartell.com:8032")\ .config("spark.yarn.stagingDir", "/tmp/")\ .master("yarn")\ .appName("Word2VecExample")\ .getOrCreate() sc = spark.sparkContext inp = sc.textFile("sample_lda_data.txt").map(lambda row: row.split(" ")) word2vec = Word2Vec() model = word2vec.fit(inp) synonyms = model.findSynonyms('1', 5) for word, cosine_distance in synonyms: print("{}: {}".format(word, cosine_distance)) sc.stop()
В этом случае мы запускаем приложение Spark с экрана терминала и просматриваем выходные данные на этом терминале.
[vagrant@bds02 ~]$ python3 Word2vec.py Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 2022-12-27 08:23:05,585 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2022-12-27 08:23:06,673 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. /usr/local/lib/python3.6/site-packages/pyspark/context.py:238: FutureWarning: Python 3.6 support is deprecated in Spark 3.2. FutureWarning 2022-12-27 08:23:17,831 WARN netlib.InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS 2022-12-27 08:23:17,835 WARN netlib.InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS 4: 0.32949933409690857 0: 0.25188517570495605 3: 0.020412154495716095 2: 0.013244825415313244 9: -0.05387714132666588
Это конец нашей части II. Надеюсь, вам понравится эта статья и всем РОЖДЕСТВО!