Goutte — Получить внутренние значения из $crawler-›filter()

Я использую PHP 7.1.33 и "fabpot/goutte": "^3.2". Мой файл композитора выглядит следующим образом:

{
    "name": "ubuntu/workspace",
    "require": {
        "fabpot/goutte": "^3.2"
    },
    "authors": [
        {
            "name": "admin",
            "email": "[email protected]"
        }
    ]
}

Я пытаюсь получить детали по временному диапазону с веб-страницы, но изо всех сил пытаюсь передать $crawler-значения в мой окончательный массив результатов $res1Array.

Я пробовал следующее:

<?php
require 'vendor/autoload.php';

use Goutte\Client;
use Symfony\Component\DomCrawler\Crawler;

/**
 * Crawls Detail Calender
 * Does NOT also include wanted Date in the final result set
 * @param $wantedDate
 * @return array
 */
function updateCalendarDetailsData($wantedDate)
{
    try {
        $client = new Client();

        /*
        $x = 1;
        $LIMIT = 3;
        global $x;
        global $LIMIT;
        $x++;
        */
        $res1Array = array();

        $ffUrlArr = ["https://www.forexfactory.com/calendar.php?month=Jan2020"];
        foreach ($ffUrlArr as $key => $v) {

            try {
                $crawler = $client->request('GET', $ffUrlArr[$key]);
            } catch (\Exception $ex) {
                error_log($ex);
            }

            $TEMP = array();

            // $count = $crawler->filter('.calendar_row')->count();
            // $i = 1; // count starts at 1
            $nodeDate = date('Y-m-d');
            $crawler->filter('.calendar_row')->each(function ($node) use (&$res1Array, $wantedDate, $nodeDate) { // $count, $i,
                $EVENT = array();

                // check date for month
                $dayMonth = str_split(explode(" ", trim($node->getNode(0)->nodeValue))[0], 3);
                $day = explode(" ", trim($node->getNode(0)->nodeValue))[1];
                if (is_numeric($day)) {
                    $nodeDate = date("Y-m-d H:i:s", strtotime($dayMonth[0] . " " . $dayMonth[1] . " " . $day));
                }

                // return if wanted date is reached
                if (date("Y-m-d", strtotime($nodeDate)) == date("Y-m-d", strtotime($wantedDate))) {
                    return $res1Array;
                }

                $EVENTID = $node->attr('data-eventid');

                $API_RESPONSE = file_get_contents('https://www.forexfactory.com/flex.php?do=ajax&contentType=Content&flex=calendar_mainCal&details=' . $EVENTID);

                $API_RESPONSE = str_replace("<![CDATA[", "", $API_RESPONSE);
                $API_RESPONSE = str_replace("]]>", "", $API_RESPONSE);

                $html = <<<HTML
<!DOCTYPE html>
<html>
    <body>
       $API_RESPONSE
    </body>
</html>
HTML;

                $subcrawler = new Crawler($html);

                $subcrawler->filter('.calendarspecs__spec')->each(function ($LEFT_TD) use (&$res1Array, &$TEMP, &$EVENT) {

                    $LEFT_TD_INNER_TEXT = trim($LEFT_TD->text());

                    if ($LEFT_TD_INNER_TEXT == "Source") {

                        $TEMP = array();
                        $LEFT_TD->nextAll()->filter('a')->each(function ($LINK) use (&$TEMP) {
                            array_push($TEMP, $LINK->text(), $LINK->attr('href'));
                        });

                        $EVENT['sourceTEXT'] = $TEMP[0];
                        $EVENT['sourceURL'] = $TEMP[1];
                        $EVENT['latestURL'] = $TEMP[3];
                    }

                    if ($LEFT_TD_INNER_TEXT == "Measures") {
                        $EVENT['measures'] = $LEFT_TD->nextAll()->text();
                    }

                    if ($LEFT_TD_INNER_TEXT == "Usual Effect") {
                        $EVENT['usual_effect'] = $LEFT_TD->nextAll()->text();
                    }

                    if ($LEFT_TD_INNER_TEXT == "Frequency") {
                        $EVENT['frequency'] = $LEFT_TD->nextAll()->text();
                    }

                    if ($LEFT_TD_INNER_TEXT == "Why Traders") {
                        $EVENT['why_traders_care'] = $LEFT_TD->nextAll()->text();
                    }

                    if ($LEFT_TD_INNER_TEXT == "Derived Via") {
                        $EVENT['derived_via'] = $LEFT_TD->nextAll()->text();
                        // array_push($res1Array, $EVENT); // <---- HERE I GET THE ERROR!
                    }
                });
                /*
                $i++;
                if ($i > $count) {
                    echo "<pre>";
                    var_dump($res1Array);
                    print_r($res1Array);
                    echo "</pre>";
                    exit;
                }
                */
            });
        }
    } catch (\Exception $ex) {
        error_log($ex);
    }
    return $res1Array;
}

var_dump(updateCalendarDetailsData(date("2020-01-02")));

Как вы можете видеть, я пытаюсь создать $EVENT и помещать все нужные значения в виде пар ключ-значение внутрь. Когда я закончу, я хочу подтолкнуть его к $resArray, получив следующую структуру (значения в этом array() предназначены только для структурных целей):

[
    sourceTEXT => "test", 
    sourceURL => "test",
    latestURL => "test", 
    measures => "test",
    usual_effect => "test",
    derived_via => "test",
    why_traders_care => "test",
    frequency => "test"
],
[
    sourceTEXT => "test1", 
    sourceURL => "test1",
    latestURL => "test1", 
    measures => "test1",
    usual_effect => "test1",
    derived_via => "test1",
    why_traders_care => "test1",
    frequency => "test1"
],
[
    sourceTEXT => "test2", 
    sourceURL => "test2",
    latestURL => "test2", 
    measures => "test2",
    usual_effect => "test2",
    derived_via => "test2",
    why_traders_care => "test2",
    frequency => "test2"
], 
// ... 

В настоящее время я ничего не получаю в моем $res1Array.

Я очень ценю ваши ответы!

ОБНОВЛЕНИЕ

Я запустил скрипт от @tftd с "fabpot/goutte": "^4.0", но получил вот это:

array(94) {
  [0] =>
  array(10) {
    'eventId' =>
    string(6) "114340"
    'date' =>
    string(10) "2020-01-01"
    'sourceTEXT' =>
    NULL
    'sourceURL' =>
    NULL
    'latestURL' =>
    NULL
    'measures' =>
    NULL
    'usual_effect' =>
    NULL
    'derived_via' =>
    NULL
    'why_traders_care' =>
    NULL
    'frequency' =>
    NULL
  }
  [1] =>
  array(10) {
    'eventId' =>
    string(6) "114341"
    'date' =>
    string(10) "2020-01-01"
    'sourceTEXT' =>
    NULL
    'sourceURL' =>
    NULL
    'latestURL' =>
    NULL
    'measures' =>
    NULL
    'usual_effect' =>
    NULL
    'derived_via' =>
    NULL
    'why_traders_care' =>
    NULL
    'frequency' =>
    NULL
  }
  [2] =>
  array(10) {
    'eventId' =>
    string(6) "114342"
    'date' =>
    string(10) "2020-01-01"
    'sourceTEXT' =>
    NULL
    'sourceURL' =>
    NULL
    'latestURL' =>
    NULL
    'measures' =>
    NULL
    'usual_effect' =>
    NULL
    'derived_via' =>
    NULL
    'why_traders_care' =>
    NULL
    'frequency' =>
    NULL
  }
  [3] =>
  array(10) {
    'eventId' =>
    string(6) "114343"
    'date' =>
    string(10) "2020-01-01"
    'sourceTEXT' =>
    NULL
    'sourceURL' =>
    NULL
    'latestURL' =>
    NULL
    'measures' =>
    NULL
    'usual_effect' =>
    NULL
    'derived_via' =>
    NULL
    'why_traders_care' =>
    NULL
    'frequency' =>
    NULL
  }
  [4] =>
  array(10) {
    'eventId' =>
    string(6) "114328"
    'date' =>
    string(10) "2020-01-01"
    'sourceTEXT' =>
    NULL
    'sourceURL' =>
    NULL
    'latestURL' =>
    NULL
    'measures' =>
    NULL
    'usual_effect' =>
    NULL
    'derived_via' =>
    NULL
    'why_traders_care' =>
    NULL
    'frequency' =>
    NULL
  }
  [5] =>
  array(10) {
    'eventId' =>
    string(6) "113632"
    'date' =>
    string(10) "2020-01-01"
    'sourceTEXT' =>
    NULL
    'sourceURL' =>
    NULL
    'latestURL' =>
    NULL
    'measures' =>
    NULL
    'usual_effect' =>
    NULL
    'derived_via' =>
    NULL
    'why_traders_care' =>
    NULL
    'frequency' =>
    NULL
  }
  [6] =>
  array(10) {
    'eventId' =>
    string(6) "114308"
    'date' =>
    string(10) "2020-01-01"
    'sourceTEXT' =>
    NULL
    'sourceURL' =>
    NULL
    'latestURL' =>
    NULL
    'measures' =>
    NULL
    'usual_effect' =>
    NULL
    'derived_via' =>
    NULL
    'why_traders_care' =>
    NULL
    'frequency' =>
    NULL
  }
// ...

Любые предложения, почему я получаю все эти нулевые значения?


person Carol.Kar    schedule 04.01.2020    source источник
comment
Насколько я понимаю, вы хотите проанализировать строки таблицы для определенной даты, т.е. 2020-01-02, в массив, который содержит данные строки. Это правильно?   -  person tftd    schedule 10.01.2020
comment
@tftd Я хочу проанализировать строки таблицы с сегодняшнего дня до определенной даты в будущем now() - 2020/01-18 (что не является частью приведенного выше примера, поскольку оно начинается с самого начала и продолжается до определенной даты, однако я мог бы просто пропустить ненужные строки ). Моя большая проблема в том, что я возвращаю пустой массив. Приведите полностью рабочий пример.   -  person Carol.Kar    schedule 10.01.2020
comment
@Anna.Klee, что касается вашего ОБНОВЛЕНИЯ на вопрос: вы уверены, что используемые вами ссылки являются реальными API, а не какими-то тестовыми конечными точками? Настоящие конечные точки API обычно требуют некоторых учетных данных. И не только многие поля из forexfactory.com/flex.php... пусты, но и то, что вы можете увидеть в браузере, когда вы посещаете forexfactory.com/calendar.php?month=Jan2020 отличается от того, что вы можете получить с file_get_contents() или $client->request(). Например. попробуйте найти идентификатор события 113606   -  person x00    schedule 15.01.2020


Ответы (3)


Я позволил себе немного переписать ваш код, используя ООП, вместо того, чтобы оставить его функциональным, потому что гораздо проще сосредоточиться на более мелких фрагментах кода. Должно быть легко преобразовать его в функциональное кодирование, если оно вам понадобится.

Этот класс принимает date, который отформатирован в Jan2020, чтобы иметь возможность получить календарь.

 $parser = new CalendarParser(date_create());

Чтобы получить события для диапазона дат в записях календаря, вам нужно вызвать $parser->getEventsBetweenDates() с startDate и endDate. Часы не учитываются при разборе, но вы можете добавить их, если вам это нужно. Вот пример:

$parser->getEventsBetweenDates(
   date_create_from_format('Y-m-d H:i:s', '2020-01-01 00:00:00'),
   date_create_from_format('Y-m-d H:i:s', '2020-01-02 23:59:59')
)

Результат приведенного выше кода:

<!-- language: lang-none -->

array(22) { 
  [0] => array(10) {
    'eventId' => string(6) "114340"
    'date' => string(10) "2020-01-01"
    'sourceTEXT' => NULL
    'sourceURL' => NULL
    'latestURL' => NULL
    'measures' => NULL
    'usual_effect' => NULL
    'derived_via' => NULL
    'why_traders_care' => string(230) "Banks facilitate the majority of foreign exchange volume. When they are closed the market is less liquid and speculators become a more dominant market influence. This can lead to both abnormally low and abnormally high volatility;"
    'frequency' => NULL
  }
  [1] => array(10) {
    'eventId' => string(6) "114341"
    'date' => string(10) "2020-01-01"
    'sourceTEXT' => NULL
    'sourceURL' => NULL
    'latestURL' => NULL
    'measures' => NULL
    'usual_effect' => NULL
    'derived_via' => NULL
    'why_traders_care' => string(230) "Banks facilitate the majority of foreign exchange volume. When they are closed the market is less liquid and speculators become a more dominant market influence. This can lead to both abnormally low and abnormally high volatility;"
    'frequency' => NULL
  }
  [2] => array(10) {
    'eventId' => string(6) "114342"
    'date' => string(10) "2020-01-01"
    'sourceTEXT' => NULL
    'sourceURL' => NULL
    'latestURL' => NULL
    'measures' => NULL
    'usual_effect' => NULL
    'derived_via' => NULL
    'why_traders_care' => string(230) "Banks facilitate the majority of foreign exchange volume. When they are closed the market is less liquid and speculators become a more dominant market influence. This can lead to both abnormally low and abnormally high volatility;"
    'frequency' => NULL
  }
  [3] => array(10) {
    'eventId' => string(6) "114343"
    'date' => string(10) "2020-01-01"
    'sourceTEXT' => NULL
    'sourceURL' => NULL
    'latestURL' => NULL
    'measures' => NULL
    'usual_effect' => NULL
    'derived_via' => NULL
    'why_traders_care' => string(230) "Banks facilitate the majority of foreign exchange volume. When they are closed the market is less liquid and speculators become a more dominant market influence. This can lead to both abnormally low and abnormally high volatility;"
    'frequency' => NULL
  }
  [4] => array(10) {
    'eventId' => string(6) "114328"
    'date' => string(10) "2020-01-01"
    'sourceTEXT' => NULL
    'sourceURL' => NULL
    'latestURL' => NULL
    'measures' => NULL
    'usual_effect' => NULL
    'derived_via' => NULL
    'why_traders_care' => string(230) "Banks facilitate the majority of foreign exchange volume. When they are closed the market is less liquid and speculators become a more dominant market influence. This can lead to both abnormally low and abnormally high volatility;"
    'frequency' => NULL
  }
  [5] => array(10) {
    'eventId' => string(6) "113632"
    'date' => string(10) "2020-01-01"
    'sourceTEXT' => NULL
    'sourceURL' => NULL
    'latestURL' => NULL
    'measures' => NULL
    'usual_effect' => NULL
    'derived_via' => NULL
    'why_traders_care' => string(230) "Banks facilitate the majority of foreign exchange volume. When they are closed the market is less liquid and speculators become a more dominant market influence. This can lead to both abnormally low and abnormally high volatility;"
    'frequency' => NULL
  }
  [6] => array(10) {
    'eventId' => string(6) "114308"
    'date' => string(10) "2020-01-01"
    'sourceTEXT' => NULL
    'sourceURL' => NULL
    'latestURL' => NULL
    'measures' => NULL
    'usual_effect' => NULL
    'derived_via' => NULL
    'why_traders_care' => string(230) "Banks facilitate the majority of foreign exchange volume. When they are closed the market is less liquid and speculators become a more dominant market influence. This can lead to both abnormally low and abnormally high volatility;"
    'frequency' => NULL
  }
  [7] => array(10) {
    'eventId' => string(6) "113607"
    'date' => string(10) "2020-01-01"
    'sourceTEXT' => NULL
    'sourceURL' => NULL
    'latestURL' => NULL
    'measures' => NULL
    'usual_effect' => NULL
    'derived_via' => NULL
    'why_traders_care' => string(230) "Banks facilitate the majority of foreign exchange volume. When they are closed the market is less liquid and speculators become a more dominant market influence. This can lead to both abnormally low and abnormally high volatility;"
    'frequency' => NULL
  }
  [8] => array(10) {
    'eventId' => string(6) "113816"
    'date' => string(10) "2020-01-01"
    'sourceTEXT' => NULL
    'sourceURL' => NULL
    'latestURL' => NULL
    'measures' => NULL
    'usual_effect' => NULL
    'derived_via' => NULL
    'why_traders_care' => string(230) "Banks facilitate the majority of foreign exchange volume. When they are closed the market is less liquid and speculators become a more dominant market influence. This can lead to both abnormally low and abnormally high volatility;"
    'frequency' => NULL
  }
  [9] => array(10) {
    'eventId' => string(6) "114718"
    'date' => string(10) "2020-01-02"
    'sourceTEXT' => string(25) "Reserve Bank of Australia"
    'sourceURL' => string(21) "http://www.rba.gov.au"
    'latestURL' => string(65) "http://www.rba.gov.au/statistics/frequency/commodity-prices/2019/"
    'measures' => string(52) "Change in the selling price of exported commodities;"
    'usual_effect' => string(54) "'Actual' greater than 'Forecast' is good for currency;"
    'derived_via' => string(120) "The average selling price of the nation's main commodity exports are sampled and then compared to the previous sampling;"
    'why_traders_care' => string(128) "It's a leading indicator of the nation's trade balance with other countries because rising commodity prices boost export income;"
    'frequency' => string(65) "Released monthly, on the first business day after the month ends;"
  }
  [10] => array(10) {
    'eventId' => string(6) "114344"
    'date' => string(10) "2020-01-02"
    'sourceTEXT' => NULL
    'sourceURL' => NULL
    'latestURL' => NULL
    'measures' => NULL
    'usual_effect' => NULL
    'derived_via' => NULL
    'why_traders_care' => string(230) "Banks facilitate the majority of foreign exchange volume. When they are closed the market is less liquid and speculators become a more dominant market influence. This can lead to both abnormally low and abnormally high volatility;"
    'frequency' => NULL
  }
  [11] => array(10) {
    'eventId' => string(6) "111383"
    'date' => string(10) "2020-01-02"
    'sourceTEXT' => string(6) "Markit"
    'sourceURL' => string(30) "http://www.markiteconomics.com"
    'latestURL' => string(72) "https://www.markiteconomics.com/Public/Release/PressReleases?language=en"
    'measures' => string(95) "Level of a diffusion index based on surveyed purchasing managers in the manufacturing industry;"
    'usual_effect' => string(54) "'Actual' greater than 'Forecast' is good for currency;"
    'derived_via' => string(204) "Survey of about 400 purchasing managers which asks respondents to rate the relative level of business conditions including employment, production, new orders, prices, supplier deliveries, and inventories;"
    'why_traders_care' => string(213) "It's a leading indicator of economic health - businesses react quickly to market conditions, and their purchasing managers hold perhaps the most current and relevant insight into the company's view of the economy;"
    'frequency' => string(65) "Released monthly, on the first business day after the month ends;"
  }
  [12] => array(10) {
    'eventId' => string(6) "111382"
    'date' => string(10) "2020-01-02"
    'sourceTEXT' => string(6) "Markit"
    'sourceURL' => string(30) "http://www.markiteconomics.com"
    'latestURL' => string(72) "https://www.markiteconomics.com/Public/Release/PressReleases?language=en"
    'measures' => string(95) "Level of a diffusion index based on surveyed purchasing managers in the manufacturing industry;"
    'usual_effect' => string(54) "'Actual' greater than 'Forecast' is good for currency;"
    'derived_via' => string(204) "Survey of about 450 purchasing managers which asks respondents to rate the relative level of business conditions including employment, production, new orders, prices, supplier deliveries, and inventories;"
    'why_traders_care' => string(213) "It's a leading indicator of economic health - businesses react quickly to market conditions, and their purchasing managers hold perhaps the most current and relevant insight into the company's view of the economy;"
    'frequency' => string(65) "Released monthly, on the first business day after the month ends;"
  }
  [13] => array(10) {
    'eventId' => string(6) "111379"
    'date' => string(10) "2020-01-02"
    'sourceTEXT' => string(6) "Markit"
    'sourceURL' => string(30) "http://www.markiteconomics.com"
    'latestURL' => string(72) "https://www.markiteconomics.com/Public/Release/PressReleases?language=en"
    'measures' => string(95) "Level of a diffusion index based on surveyed purchasing managers in the manufacturing industry;"
    'usual_effect' => string(54) "'Actual' greater than 'Forecast' is good for currency;"
    'derived_via' => string(204) "Survey of about 750 purchasing managers which asks respondents to rate the relative level of business conditions including employment, production, new orders, prices, supplier deliveries, and inventories;"
    'why_traders_care' => string(213) "It's a leading indicator of economic health - businesses react quickly to market conditions, and their purchasing managers hold perhaps the most current and relevant insight into the company's view of the economy;"
    'frequency' => string(65) "Released monthly, on the first business day after the month ends;"
  }
  [14] => array(10) {
    'eventId' => string(6) "111380"
    'date' => string(10) "2020-01-02"
    'sourceTEXT' => string(6) "Markit"
    'sourceURL' => string(30) "http://www.markiteconomics.com"
    'latestURL' => string(72) "https://www.markiteconomics.com/Public/Release/PressReleases?language=en"
    'measures' => string(95) "Level of a diffusion index based on surveyed purchasing managers in the manufacturing industry;"
    'usual_effect' => string(54) "'Actual' greater than 'Forecast' is good for currency;"
    'derived_via' => string(204) "Survey of about 800 purchasing managers which asks respondents to rate the relative level of business conditions including employment, production, new orders, prices, supplier deliveries, and inventories;"
    'why_traders_care' => string(213) "It's a leading indicator of economic health - businesses react quickly to market conditions, and their purchasing managers hold perhaps the most current and relevant insight into the company's view of the economy;"
    'frequency' => string(65) "Released monthly, on the first business day after the month ends;"
  }
  [15] => array(10) {
    'eventId' => string(6) "111381"
    'date' => string(10) "2020-01-02"
    'sourceTEXT' => string(6) "Markit"
    'sourceURL' => string(30) "http://www.markiteconomics.com"
    'latestURL' => string(72) "https://www.markiteconomics.com/Public/Release/PressReleases?language=en"
    'measures' => string(95) "Level of a diffusion index based on surveyed purchasing managers in the manufacturing industry;"
    'usual_effect' => string(54) "'Actual' greater than 'Forecast' is good for currency;"
    'derived_via' => string(205) "Survey of about 5000 purchasing managers which asks respondents to rate the relative level of business conditions including employment, production, new orders, prices, supplier deliveries, and inventories;"
    'why_traders_care' => string(213) "It's a leading indicator of economic health - businesses react quickly to market conditions, and their purchasing managers hold perhaps the most current and relevant insight into the company's view of the economy;"
    'frequency' => string(65) "Released monthly, on the first business day after the month ends;"
  }
  [16] => array(10) {
    'eventId' => string(6) "111397"
    'date' => string(10) "2020-01-02"
    'sourceTEXT' => string(6) "Markit"
    'sourceURL' => string(30) "http://www.markiteconomics.com"
    'latestURL' => string(72) "https://www.markiteconomics.com/Public/Release/PressReleases?language=en"
    'measures' => string(95) "Level of a diffusion index based on surveyed purchasing managers in the manufacturing industry;"
    'usual_effect' => string(54) "'Actual' greater than 'Forecast' is good for currency;"
    'derived_via' => string(204) "Survey of about 650 purchasing managers which asks respondents to rate the relative level of business conditions including employment, production, new orders, prices, supplier deliveries, and inventories;"
    'why_traders_care' => string(213) "It's a leading indicator of economic health - businesses react quickly to market conditions, and their purchasing managers hold perhaps the most current and relevant insight into the company's view of the economy;"
    'frequency' => string(65) "Released monthly, on the first business day after the month ends;"
  }
  [17] => array(10) {
    'eventId' => string(6) "111102"
    'date' => string(10) "2020-01-02"
    'sourceTEXT' => string(34) "Challenger, Gray & Christmas, Inc."
    'sourceURL' => string(30) "http://www.challengergray.com/"
    'latestURL' => string(50) "http://www.challengergray.com/press/press-releases"
    'measures' => string(56) "Change in the number of job cuts announced by employers;"
    'usual_effect' => string(51) "'Actual' less than 'Forecast' is good for currency;"
    'derived_via' => NULL
    'why_traders_care' => NULL
    'frequency' => string(52) "Released monthly, about 3 days after the month ends;"
  }
  [18] => array(10) {
    'eventId' => string(6) "110766"
    'date' => string(10) "2020-01-02"
    'sourceTEXT' => string(19) "Department of Labor"
    'sourceURL' => string(18) "http://www.dol.gov"
    'latestURL' => string(20) "https://www.dol.gov/"
    'measures' => string(103) "The number of individuals who filed for unemployment insurance for the first time during the past week;"
    'usual_effect' => string(51) "'Actual' less than 'Forecast' is good for currency;"
    'derived_via' => NULL
    'why_traders_care' => string(306) "Although it's generally viewed as a lagging indicator, the number of unemployed people is an important signal of overall economic health because consumer spending is highly correlated with labor-market conditions. Unemployment is also a major consideration for those steering the country's monetary policy;"
    'frequency' => string(44) "Released weekly, 5 days after the week ends;"
  }
  [19] => array(10) {
    'eventId' => string(6) "113642"
    'date' => string(10) "2020-01-02"
    'sourceTEXT' => string(6) "Markit"
    'sourceURL' => string(30) "http://www.markiteconomics.com"
    'latestURL' => string(72) "https://www.markiteconomics.com/Public/Release/PressReleases?language=en"
    'measures' => string(95) "Level of a diffusion index based on surveyed purchasing managers in the manufacturing industry;"
    'usual_effect' => string(54) "'Actual' greater than 'Forecast' is good for currency;"
    'derived_via' => string(204) "Survey of about 400 purchasing managers which asks respondents to rate the relative level of business conditions including employment, production, new orders, prices, supplier deliveries, and inventories;"
    'why_traders_care' => string(213) "It's a leading indicator of economic health - businesses react quickly to market conditions, and their purchasing managers hold perhaps the most current and relevant insight into the company's view of the economy;"
    'frequency' => string(65) "Released monthly, on the first business day after the month ends;"
  }
  [20] => array(10) {
    'eventId' => string(6) "111392"
    'date' => string(10) "2020-01-02"
    'sourceTEXT' => string(6) "Markit"
    'sourceURL' => string(30) "http://www.markiteconomics.com"
    'latestURL' => string(72) "https://www.markiteconomics.com/Public/Release/PressReleases?language=en"
    'measures' => string(95) "Level of a diffusion index based on surveyed purchasing managers in the manufacturing industry;"
    'usual_effect' => string(54) "'Actual' greater than 'Forecast' is good for currency;"
    'derived_via' => string(204) "Survey of about 800 purchasing managers which asks respondents to rate the relative level of business conditions including employment, production, new orders, prices, supplier deliveries, and inventories;"
    'why_traders_care' => string(213) "It's a leading indicator of economic health - businesses react quickly to market conditions, and their purchasing managers hold perhaps the most current and relevant insight into the company's view of the economy;"
    'frequency' => string(65) "Released monthly, on the first business day after the month ends;"
  }
  [21] => array(10) {
    'eventId' => string(6) "113817"
    'date' => string(10) "2020-01-02"
    'sourceTEXT' => NULL
    'sourceURL' => NULL
    'latestURL' => NULL
    'measures' => NULL
    'usual_effect' => NULL
    'derived_via' => NULL
    'why_traders_care' => string(230) "Banks facilitate the majority of foreign exchange volume. When they are closed the market is less liquid and speculators become a more dominant market influence. This can lead to both abnormally low and abnormally high volatility;"
    'frequency' => NULL
  }
}

Вот полный код:

<?php

require 'vendor/autoload.php';

use Goutte\Client;
use Symfony\Component\DomCrawler\Crawler;

/**
 * Thinking OOP is easier for me.
 * You can easily restructure this into a `functional` code if that's what you need.
 */
class CalendarParser
{

    const BASE_URL = 'https://www.forexfactory.com/calendar.php?month=%s';
    const EVENT_URL = 'https://www.forexfactory.com/flex.php?do=ajax&contentType=Content&flex=calendar_mainCal&details=%d';

    /**
     * @var
     */
    private $client;

    /**
     * @var DateTime
     */
    private $calendarMonth;

    /**
     * @var Crawler
     */
    private $page;

    /**
     * @var Crawler
     */
    private $table;

    /**
     * @var array
     */
    private $dateIndexes;

    /**
     * CalendarParser constructor.
     *
     * @param DateTime $calendarMonth
     * @throws Exception
     */
    public function __construct(DateTime $calendarMonth)
    {
        $this->client = new Client();
        $this->calendarMonth = $calendarMonth;

        // Fetch page and table data and store it so we can iterate over it.
        $this->page = $this->client->request('GET', sprintf(self::BASE_URL, $this->calendarMonth->format('MY')));
        $this->table = $this->page->filter('.calendar_row');

        // Get date indexes
        $this->generateDateIndexes();
    }

    /**
     * The table uses a class called `newday` at each new date which can be used to create an index of
     * where the date records begin which makes parsing easier.
     */
    private function generateDateIndexes()
    {
        $dateIndexes = [];

        $previousDate = null;
        $this->table
            /**
             * NOTE: This is a closure function which will be called until the foreach completes.
             *       You cannot break out of it like when you do `foreach() { break; }`.
             *       If you do `return` - it will simply skip executing the rest of the function but won't break the cycle.
             */
            ->each(function (Crawler $node, $index) use (&$dateIndexes, &$previousDate) {
                $isNewDateSeparator = strpos($node->getNode(0)->getAttribute('class'), 'newday') !== false;

                if ($isNewDateSeparator) {
                    // Convert the date to `Jan-1-STARTING_YEAR` to be easier to search in the array.
                    $dateColumnNode = $node->filter('.date > span > span');
                    $stringDate = str_replace(' ', '-', $dateColumnNode->text()) . '-' . $this->calendarMonth->format('Y');
                    $date = date_create_from_format('M-d-Y', $stringDate);
                    $formattedDate = $date->format('Y-m-d');

                    $dateIndexes[$formattedDate] = [
                        'start' => $index,
                        'end'   => null
                    ];

                    if ($previousDate) {
                        $dateIndexes[$previousDate]['end'] = ($index - 1);
                    }

                    $previousDate = $formattedDate;
                }
            });

        $this->dateIndexes = $dateIndexes;
    }

    /**
     * @param Crawler $row
     * @return array
     */
    private function processEvent(DateTime $date, Crawler $row)
    {
        $eventId = $row->attr('data-eventid');

        $event = [
            'eventId'          => $eventId,
            'date'             => $date->format('Y-m-d'),
            'sourceTEXT'       => null,
            'sourceURL'        => null,
            'latestURL'        => null,
            'measures'         => null,
            'usual_effect'     => null,
            'derived_via'      => null,
            'why_traders_care' => null,
            'frequency'        => null
        ];

        $content = $this->client->request('GET', sprintf(self::EVENT_URL, $eventId))->html();
        $crawler = new Crawler($content, null, null);

        $table = $crawler->filter('.calendarspecs__spec')->first()->closest('table');

        $table->filter('tr')
              ->each(function (Crawler $tr) use (&$event) {
                  $label = $tr->filter('.calendarspecs__spec')->text();

                  $description = $tr->filter('.calendarspecs__specdescription');

                  if ($label === 'Source') {
                      $TEMP = [];
                      $description->filter(' a')
                                  ->each(function ($link) use (&$TEMP) {
                                      array_push($TEMP, $link->text(), $link->attr('href'));
                                  });

                      $event['sourceTEXT'] = $TEMP[0];
                      $event['sourceURL'] = $TEMP[1];
                      $event['latestURL'] = $TEMP[3];
                  }

                  if ($label == "Measures") {
                      $event['measures'] = $description->text();
                  }

                  if ($label == "Usual Effect") {
                      $event['usual_effect'] = $description->text();
                  }

                  if ($label == "Frequency") {
                      $event['frequency'] = $description->text();
                  }

                  // this is how it's returned.
                  if ($label == "Why TradersCare") {
                      $event['why_traders_care'] = $description->text();
                  }

                  if ($label == "Derived Via") {
                      $event['derived_via'] = $description->text();
                  }

              });

        return $event;
    }

    /**
     * Get the events between a start and end date.
     * If no endDate is defined - then it will get all events since $startDate.
     *
     * @param DateTime $startDate
     * @param DateTime|null $endDate
     *
     * @return array
     */
    public function getEventsBetweenDates(DateTime $startDate, DateTime $endDate = null)
    {
        $events = [];

        $totalCalendarRows = $this->table->count();
        foreach ($this->dateIndexes as $stringDate => $range) {
            $date = date_create_from_format('Y-m-d', $stringDate);

            // Process only the range from the start date
            if ($date >= $startDate) {
                // and break early when we reach the end.
                if ($endDate && $date > $endDate) {
                    break;
                }

                // collect and process events for the current date
                $start = $range['start'];
                $end = $range['end'] !== null ? $range['end'] : $totalCalendarRows;
                for ($i = $start; $i < $end; $i++) {
                    $events[] = $this->processEvent($date, new Crawler($this->table->getNode($i)));
                }
            }
        }

        return $events;
    }

}

$parser = new CalendarParser(date_create());

var_dump(
    $parser->getEventsBetweenDates(
        date_create_from_format('Y-m-d H:i:s', '2020-01-01 00:00:00'),
        date_create_from_format('Y-m-d H:i:s', '2020-01-02 23:59:59')
    )
);
person tftd    schedule 10.01.2020
comment
Эй, я не минусовал твой ответ! Я думаю, это топ! Однако при запуске класса я получаю Uncaught Error: Call to undefined method Symfony\Component\DomCrawler\Crawler::closest(). Какие версии библиотек вы используете? (composer.json) Пожалуйста, добавьте исправление, и я приму ваш ответ! - person Carol.Kar; 11.01.2020
comment
Я использую php 7.1 - person Carol.Kar; 11.01.2020
comment
Мой комментарий был для тех, кто проголосовал против, а не конкретно для вас. ИМХО, глупо минусовать, не указывая, что вы считаете неправильным. В настоящее время я использую php 7.3, но это не имеет значения. Я думаю, у вас может быть более старая версия fabpot/goutte — моя v4.0.0. РЕДАКТИРОВАТЬ: я только что заметил, что в вашем вопросе вы указали, что используете ^3.2. Можно ли обновиться до 4.0? - person tftd; 12.01.2020
comment
Полностью согласен! Я обновил свою версию до "fabpot/goutte": "^4.0", однако при запуске вашей class я получаю в результате значения только null. Пожалуйста, смотрите мое обновление выше. - person Carol.Kar; 13.01.2020
comment
Я заметил небольшую опечатку в своем классе (проверьте ревизию или просто скопируйте/вставьте). Если вы выполните поиск события 111392, вы увидите, что все поля заполнены. Значение null в других записях просто потому, что нет метки, из которой можно получить данные. - person tftd; 13.01.2020
comment
Последний вопрос: есть ли альтернатива использованию в этой строке $table = $crawler->filter('.calendarspecs__spec')->first()->closest('table'); функции closest(), поскольку для этого требуется v4.4. Был бы признателен за ваш обновленный ответ! - person Carol.Kar; 14.01.2020
comment
Для пакета fabpot/[email protected] требуется symfony/dom-crawler (именно здесь Crawler) с версиями либо ^4.4 или ^5.0. Эта функция есть в обоих релизах (см. ссылки). Я подозреваю, что с вашей стороны может быть что-то не так - я пробовал обе версии, и это работает. Может быть, проверить с помощью composer show -i, что на самом деле установлено? - person tftd; 15.01.2020
comment
Спасибо за ваш ответ! Я все еще получаю сообщение об ошибке в этой строке, хотя событие composer show -i показывает, что я установил fabpot/goutte v4.0.0 A simple PHP Web Scraper . Есть ли альтернатива методу closest()? - person Carol.Kar; 15.01.2020
comment
То, что вы говорите, не имеет смысла. Если у вас был fabpot/goutte v4.0.0, он должен был установить symfony/dom-crawler 4.4 or 5.0, если только нет другой зависимости, для которой требуется еще более низкая версия symfony/dom-crawler. Если это так, я не могу предоставить вам надежную альтернативу вовремя, потому что я не могу знать, какая версия dom-crawler у вас установлена. Не могли бы вы опубликовать весь copmser show -i в pastebin? - person tftd; 15.01.2020
comment
Спасибо за награду! Если вы вставите вывод composer, я обновлю свой ответ завтра, используя точные версии, которые у вас есть :) - person tftd; 15.01.2020

Я все еще работаю с кодом, который вы предоставили, но одна из первых вещей, которые я замечаю, прямо перед тем, как вы устанавливаете $API_RESPONSE, у вас есть следующие строки кода...

// return if wanted date is reached
if (date("Y-m-d", strtotime($nodeDate)) == date("Y-m-d", strtotime($wantedDate))) {
  return $res1Array;
}

В этот момент в функции вы еще не отправили какие-либо данные в $res1Array, поэтому она вернет только пустой массив. Только в $subcrawler (и второй попытке вернуть $res1Array) вы фактически вставляете информацию в массив.

Примечание. Я обновлю свой ответ, как только проработаю остальную часть кода, в надежде предоставить вам более полное решение вашей проблемы.

person morsecodemedia    schedule 09.01.2020
comment
Кстати, это функция закрытия. На самом деле он ничего не вернет и не будет break цикла foreach. Это только пропустит выполнение остального кода функции. - person tftd; 09.01.2020
comment
Спасибо за ваш ответ! Пожалуйста, включите полностью рабочий пример. - person Carol.Kar; 09.01.2020

Я рекомендую вам придерживаться вашего кода. Он меньше, проще и привычнее для вас.

Я сделал обзор вашего кода. Вы можете найти мои комментарии, помеченные "***".
Также вы можете сохранить этот код и сравнить его с исходной версией в каком-нибудь инструменте сравнения.

На самом деле у вас было всего 4 маленьких бага.

<?php
require 'vendor/autoload.php';

// use Goutte\Client;
use Symfony\Component\DomCrawler\Crawler;

/**
 * Crawls Detail Calender
 * Does NOT also include wanted Date in the final result set
 * @param $wantedDate
 * @return array
 */
function updateCalendarDetailsData($wantedDate)
{
    // *** small optimizations
    $Year = $wantedDate->format("Y");
    $wantedDateStr = $wantedDate->format("Y M j");

    try {
        // $client = new Client(); // *** I don't see any need in this package

        $res1Array = array();

        $ffUrlArr = ["https://www.forexfactory.com/calendar.php?month=Jan2020"];
        foreach ($ffUrlArr as $key => $v) {
        // *** There one link in ffUrlArr, it's better to get rid off foreach().
        // *** But for now - let it be

            try {
                $crawler = new Crawler(file_get_contents($ffUrlArr[$key]));
                // $crawler = $client->request('GET', $ffUrlArr[$key]);
                // *** It's the only place where Goutte was used
            } catch (\Exception $ex) {
                error_log($ex);
            }

            // $TEMP = array();
            // *** No need to define it here, it's used only inside $subcrawler,
            // *** And it's redefined there

            // $nodeDate = date('Y-m-d');
            // *** no need for date('Y-m-d')
            $nodeDate = "";
            // $crawler->filter('.calendar_row')->each(function ($node) use (&$res1Array, $wantedDate, $nodeDate) {
            // *** BUG 1: here your forgot to put "&" before $nodeDate

            // *** Also, because you need to return on $wantedDate,
            // *** but you can not break from the each()
            // *** it is better to use foreach(), and in my opinion it
            // *** looks simpler. And it is less error prone,
            // *** as we can see.

            // *** By using '[data-eventid][data-touchable]' instead
            // *** of '.calendar_row' we can get rid of multiple requests
            // *** to forexfactory API with same $EVENTID
            foreach($crawler->filter('[data-eventid][data-touchable]') as $DOM_el) {
                $node = new Crawler($DOM_el);

                // $EVENT = array();
                // *** it's almost always better to define variable
                // *** near the place they are used. Moved it

                // check date for month
                // $dayMonth = str_split(explode(" ", trim($node->getNode(0)->nodeValue))[0], 3);
                // $day = explode(" ", trim($node->getNode(0)->nodeValue))[1];
                // if (is_numeric($day)) {
                //     $nodeDate = date("Y-m-d H:i:s", strtotime($dayMonth[0] . " " . $dayMonth[1] . " " . $day));
                // }
                // *** This is a cleaner and a simpler way to retrive
                // *** a date from this html. Getting nodeDate in the
                // *** form of "Y M j" (e.g. "2020 Jan 1")
                $date_node = $node->filter('.date > span > span');
                if( $date_node->count() != 0 ) {
                    $nodeDate = $Year . " " . $date_node->text();
                }

                // return if wanted date is reached
                // if (date("Y-m-d", strtotime($nodeDate)) == date("Y-m-d", strtotime($wantedDate))) {
                // *** There is no need for so many convertions.
                // *** Strings' comparison is good enough

                // *** BUG 2: Not critical, but "havy".
                // *** Because you can not break from ->each()
                // *** checking dates with "==" led to skiping only
                // *** $wantedDate, all dates after $wantedDate
                // *** were still iterated over
                if ($nodeDate == $wantedDateStr) {
                    // return $res1Array;
                    // *** Now, when we use foreach() instead of
                    // *** ->each() we can return from here.
                    // *** But still, I think it's better to use break.
                    // *** In case you would like to add some extra logic
                    // *** at the end, and for other vague reasons :)
                    break;
                }

                $EVENTID = $node->attr('data-eventid');

                $API_RESPONSE = file_get_contents('https://www.forexfactory.com/flex.php?do=ajax&contentType=Content&flex=calendar_mainCal&details=' . $EVENTID);

                $API_RESPONSE = str_replace("<![CDATA[", "", $API_RESPONSE);
                $API_RESPONSE = str_replace("]]>", "", $API_RESPONSE);

                $html = <<<HTML
<!DOCTYPE html>
<html>
    <body>
       $API_RESPONSE
    </body>
</html>
HTML;

                $subcrawler = new Crawler($html);

                // *** Took this part from tftd's answer
                // *** It's a good practice to define all possible fields
                $EVENT = [
                    'id'               => $EVENTID,
                    'date'             => $nodeDate,
                    'sourceTEXT'       => null,
                    'sourceURL'        => null,
                    'latestURL'        => null,
                    'measures'         => null,
                    'usual_effect'     => null,
                    'derived_via'      => null,
                    'why_traders_care' => null,
                    'frequency'        => null
                ];
                // $EVENT = array(); // *** But you can always switch back for this simple definition
                // $subcrawler->filter('.calendarspecs__spec')->each(function ($LEFT_TD) use (&$res1Array, &$TEMP, &$EVENT) {
                // *** once again switching from ->each() to foreach(),
                // *** just for the consistency
                foreach($subcrawler->filter('.calendarspecs__spec') as $DOM_el) {
                    $LEFT_TD = new Crawler($DOM_el);

                    $LEFT_TD_INNER_TEXT = trim($LEFT_TD->text());

                    if ($LEFT_TD_INNER_TEXT == "Source") {

                        $TEMP = array();
                        $LEFT_TD->nextAll()->filter('a')->each(function ($LINK) use (&$TEMP) {
                            array_push($TEMP, $LINK->text(), $LINK->attr('href'));
                        });

                        $EVENT['sourceTEXT'] = $TEMP[0];
                        $EVENT['sourceURL'] = $TEMP[1];
                        $EVENT['latestURL'] = $TEMP[3];
                    }

                    if ($LEFT_TD_INNER_TEXT == "Measures") {
                        $EVENT['measures'] = $LEFT_TD->nextAll()->text();
                    }

                    if ($LEFT_TD_INNER_TEXT == "Usual Effect") {
                        $EVENT['usual_effect'] = $LEFT_TD->nextAll()->text();
                    }

                    if ($LEFT_TD_INNER_TEXT == "Frequency") {
                        $EVENT['frequency'] = $LEFT_TD->nextAll()->text();
                    }

                    if ($LEFT_TD_INNER_TEXT == "Why TradersCare") {
                        // *** BUG 3: As tftd noticed - you had an issue
                        // *** with name of this field
                        $EVENT['why_traders_care'] = $LEFT_TD->nextAll()->text();
                    }

                    if ($LEFT_TD_INNER_TEXT == "Derived Via") {
                        $EVENT['derived_via'] = $LEFT_TD->nextAll()->text();
                        // array_push($res1Array, $EVENT); // <---- HERE I GET THE ERROR!
                        // *** BUG 4: And this was the main complication
                        // *** 1) Being here array_push() wasn't called if event
                        // ***    had no "Derived Via" field
                        // *** 2) but even more than that... it was somehow put
                        // ***    in the comments... and of course this led to
                        // ***    $res1Array never been populated
                    }
                }
                array_push($res1Array, $EVENT);
                // *** this command should be here
            }
        }
    } catch (Exception $ex) {
        error_log($ex);
    }
    return $res1Array;
}
// *** You'd better use DateTime, so its fields could be manipulated
// *** and retrieved more easily than in the case of a string representation
// var_dump(updateCalendarDetailsData(date("2020-01-02")));
var_dump(updateCalendarDetailsData(new DateTime("2020-01-02")));

?> 
person x00    schedule 12.01.2020