Web Scraper written in C++

What

My Python + Selenium Web-Scraper requires Chrome, ChromeDriver, Python, xvfb (X11 virtual frame buffer) and a bunch of hacks for hidden headless mode. I want something small, efficient, fast and low dependency that I can run on as Raspberry Pi or on AWS Lambda (with binary size limit). For this I need something that can render a webpage and also execute the Java-Script embedded in it. I also need to to have the ability to inject Java-Script code, so I can click away cookie-banners, fill out HTML forms and click submit automatically.

Most browsers use WebKit underneath. Apple’s Safari and also Chrome used to use it. In order to use WebKit we need a graphical user interface toolkit. Qt is relatively large. So I’ll go with wxWidgets.

Contents

When

Many of my hobbyist projects rely on scraping the web for data. Be it real estate listings, stock market prices or news aggregation.

It used to be possible to just connect a TCP socket to port 80, manually craft a GET request and send it off. Nowadays we need SSL encryption and key exchange schemes. Since Web 2.0 we often need to execute JavaScript code for the pages to even display the content we’re after. Then there are advertisements and coockie accept banners that need to be automatically clicked.

For this we need to run full browsers. In the past I’ve orchestrated Firefox or Chrome using the Python Selenium libraries. A more efficient method is to run smaller embedded browsers from C++ code and get much of the same features by injecting JavaScript, as I’ll show in the following.

Background

Most web scrapers use Python and Selenium. This is find and a realively simple approach for the developer. It comes with a performance hit and requires a lot of dependencies. This also makes it unstable, when underlying software gets updated.

Old Approach: Python + Selenium

When scrapping data with Python this is what I used to do:

# -- working headless mode by running in virtual frame buffer Xvfb
display = Display(visible=0, size=(1920, 1080))
display.start()

# -- some covert options to hide selenium
options = webdriver.ChromeOptions()
options.add_argument("--disable-blink-features=AutomationControlled") 
options.add_experimental_option("excludeSwitches", ["enable-automation"]) 
options.add_experimental_option("useAutomationExtension", False) 

# -- add UBlock Origin Extension to remove full-page ad-bloat that hinders 
extraBrowserExtension = "<path to ublock origin extension for this spectic browser (*.crx file)>"
options.add_extension(extraBrowserExtension)

# -- load Chrome
driver = webdriver.Chrome(options=options)

# -- some more covert options
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")

useragent = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"]
driver.execute_cdp_cmd(
    "Network.setUserAgentOverride", {"userAgent": useragent}
)

stealth(driver,
    languages=["en-US", "en"],
    vendor="Google Inc.",
    platform="Win32",
    webgl_vendor="Intel Inc.",
    renderer="Intel Iris OpenGL Engine",
    fix_hairline=True,
)

driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
    "source": """
    Object.defineProperty(navigator, 'webdriver', {
        get: () => undefined
    })
    """
})

Then load a page

# -- open url
driver.get(grab_url)

Using my helper functions to get the page body and HTML elements, when they are finally loaded:

def webdriver_find_element(driver, root, cssSelector):
  for i in range(5):
    try:
      element = root.find_element(By.CSS_SELECTOR, cssSelector);
      return element
    except NoSuchElementException:
      time.sleep(0.5)
  print("timeout waiting for cookie accept")


def webdriver_get_html_element_body(driver):
  return WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.TAG_NAME, "body")))

Get the page, get subelements by CSS selectors, inject javascript to change attributes and click on elements.

# -- get html body
html_element_body = webdriver_get_html_element_body(driver)

# - get inner div
html_element_table = webdriver_find_element(driver, html_element_body, "main[class='page-content']")

(..)

# -- inject javascript
driver.execute_script("arguments[0].setAttribute('value', '" + date_start +"')", html_element_dtDate1);

(..)

# -- click submit
html_element_submit.click()

and finally extract the results

# -- extract information
print(html_element.get_attribute('outerHTML'))

That runs fine, but is slow and requires a lot of dependencies to be installed.

How

Choice of Tools

So in order to stream-line that in C++ we need to first think about what C++ libraries we can use to achieve our goals.

Browser Engines

Most browsers nowadays use some form of WebKit underneath. Safari uses WebKit. The Chrome Engine is also based on WebKit. Firefox has transitioned to Chrome, thereby also settling for WebKit.

It’s far too challenging to write a custom rendering engine and to keep it compatible with all the features the modern web contains. As mentioned above encryption and JavaScript execution is also required. So a good choice is to use WebKit.

Graphics User Interface library

Even though the code is intended to run automatically and headless (hidden window), we need some framework to run WebKit in. My first impuls was to use Qt, my second was GTK, but then I remembered wxWidgets. It’s an older, smaller graphical user interface. It can be built statically. It has a WebKit integration called wxWebView.

Implementation

With the choice of tools out of the way we can start building our application in C++ with wxWidgets and Webkit.

wxWidgets Example Window

First let’s get an example [1] off the ground.

I’ll copy the example code into a single source code file, main.cpp, for now. Later I’ll transfer most of it to a functional/procedural-style and get rid of the classes. That will make the code way more portable and useful for our web scraping purposes.

#include <wx/wx.h>

class Simple : public wxFrame {
public:
    Simple(const wxString& title);
};

Simple::Simple(const wxString& title)
       : wxFrame(NULL, wxID_ANY, title, wxDefaultPosition, wxSize(250, 150)) {
  Centre();
}

class MyApp : public wxApp
{
  public:
    virtual bool OnInit();
};

IMPLEMENT_APP(MyApp)

bool MyApp::OnInit()
{
    Simple *simple = new Simple(wxT("Simple"));
    simple->Show(true);

    return true;
}

Let’s see what the compile says, if we try to build it without specifying the depedencies.

g++ main.cpp -o main

Similar to pkg-config, there is a wx-config tool that returns the library paths.

We can add --static=yes, if we have a wxWidgets version that is statically built. This is something to remember for later.

The version installed from the Ubuntu repositories is

wx-config --version  
3.2.4

For now this will do:

g++ `wx-config --libs` `wx-config --cxxflags` main.cpp -o main

That yields a host of undefined references. Of coursse the packages in the Ubuntu repositories are once again somehow built incorrectly

undefined reference to `wxString::FromAscii(char const*)'

Compile wxWidgets ourselves

Luckily we can easiliy compile wxWidgets ourselves. This will also be useful for a static compile later on.

wget https://github.com/wxWidgets/wxWidgets/releases/download/v3.3.1/wxWidgets-3.3.1.tar.bz2
tar -xvf wxWidgets-3.3.1.tar.bz2
echo "wxWidgets-3.3.1.tar.bz2" >> .gitignore
echo "wxWidgets-3.3.1/" >> .gitignore
echo "main.o" >> .gitignore
echo "main" >> .gitignore

cd wxWidgets-3.3.1/
mkdir _build/ && cd _build/
../configure  --with-gtk=3 --with-opengl

Configured wxWidgets 3.3.1 for `x86_64-pc-linux-gnu'

  Which GUI toolkit should wxWidgets use?                 GTK+ 3 with support for GTK+ printing libnotify
  Should wxWidgets be compiled into single library?       no
  Should wxWidgets be linked as a shared library?         yes
  Unicode encoding used by wxString?                      UTF-32
  What level of wxWidgets compatibility should be enabled?
                                       wxWidgets 3.0      no
                                       wxWidgets 3.2      yes
  Which libraries should wxWidgets use?
                                       jpeg               sys
                                       png                sys
                                       regex              sys
                                       tiff               sys
                                       webp               sys
                                       lzma               no
                                       zlib               sys
                                       expat              sys
                                       libmspack          no
                                       sdl                no
                                       webview            yes (with backends: WebKit)

Make sure webview is enabled.

make -j3

Build the sample and run it

cd wxWidgets-3.3.1/_build/samples/minimal
make
./minimal

Compile Flags

By building with

make -n

Creating a Makefile for our code with a copy of the same settings, replacing minimal with main,. removing some of the excess and introducing a variable for the path to wxWidgets yields an extremly long command-line.

It’s much easier to use the wx-config tool:

all:
	g++ -std=c++11 -Wall -o main main.cpp `./wxWidgets-3.3.1/_build/wx-config \
	--debug=yes --libs std,aui,stc --unicode=yes --cxxflags --libs`

We can now run our sample code the same

make
./main

Procedural Style

As mentioned above, I’ll keep this procedural.

However there is a wxWidgets macro used that expects a class name to be given that runs some of the initialisation, e.g. gdk_init().

IMPLEMENT_APP(MyApp)

Luckily there are ways to initialize wxWidgets without that macro. This code is functionally identical to the object-oriented version. 7 lines instead of 17 to spawn the window.

#include <wx/wx.h>

int main(int argc, char** argv) {
  wxApp::SetInstance(new wxApp());
  wxEntryStart(argc, argv);
  wxTheApp->CallOnInit();

  wxFrame * frame = new wxFrame(NULL, wxID_ANY, "Simple", wxDefaultPosition, wxSize(250, 150));
  frame->Show(true);

  wxTheApp->OnRun();
  return 0;
}

WebKit

First we add the webkit and html libraries for wxWidgets.

all:
	g++ -std=c++11 -Wall -o main main.cpp `./wxWidgets-3.3.1/_build/wx-config \
	--debug=yes --libs std,aui,stc,webview,html --unicode=yes --cxxflags --libs`

and change the code to this:

#include <wx/wx.h>
#include <wx/webview.h>

int main(int argc, char** argv) {
  wxApp::SetInstance(new wxApp());
  wxEntryStart(argc, argv);
  wxTheApp->CallOnInit();

  wxFrame * frame = new wxFrame(NULL, wxID_ANY, "Simple", wxDefaultPosition, wxSize(800, 1000));
  frame->Show(true);

  wxPanel* panel = new wxPanel(frame);
  wxBoxSizer* sizer = new wxBoxSizer(wxVERTICAL);

  wxWebView * webView = wxWebView::New(panel, wxID_ANY, "https://www.google.com");
  sizer->Add(webView, 1, wxEXPAND | wxALL, 0);

  panel->SetSizer(sizer);

  wxTheApp->OnRun();
  return 0;
}

For wxWebView to run on NVidia drivers we need to add an environment variable [4]:

make && WEBKIT_DISABLE_DMABUF_RENDERER=1 ./main

and with that WebKit runs, can open a URL and displays one of the cookie popups.

Injecting JavaScript

In order to click on cookie banner accept buttons, fill out HTML forms and click on submit we need to inject JavaScript code.

This is easy to achieve using the RunScript function of wxWebView.

webView->RunScript("javascript:(function() { alert('hi');})()");

Getting Page Source

We can get the page html source with

#include <iostream>
(..)
std::cout << webView->GetPageSource().utf8_str() << std::endl;

The problem we encounter now is that the page may not have been loaded yet.

Threads and Blocking Calls

So we need to delay getting the page source. This requires some thought, because the current thread is busy rendering the web page.

Loops

We can’t just add a while(true) loop as that would block wxWebView from loading the page.

while(true) {
  std::cout << "[ ] source: " << webView->GetPageSource().utf8_str() << std::endl;
  std::this_thread::sleep_for(std::chrono::seconds(5));
}

C++ Timer

We might consider making the webView pointer global and spawning a thread as a simple timer.

std::thread timeoutThread([&]() {
  while(true) {
    	std::this_thread::sleep_for(std::chrono::milliseconds(1500));
    	std::cout << "." << std::endl;
    	std::cout << webView->GetPageSource() << std::endl;
  }
});  

This however locks the thread after the first call to GetPageSource, likely because we are calling a wxWidget widget asynchronously bypassing the event system and run into a lock within the framework.

wxButton

As a quick test we can try wiring a Button:

void OnButton(wxCommandEvent& event) {
  std::cout << "button" << std::endl;  
  std::cout << webView->GetPageSource() << std::endl;
}


int main(int argc, char** argv) {
  (...)
  wxButton * button = new wxButton(panel, wxID_ANY, "Get Source");
  sizer->Add(button, 0, wxALL, 0);
  button->Bind(wxEVT_BUTTON, &OnButton);
  (...)
}

That works great (see html source in the console in the background). But we’d have to manually click a button each for each iteration.

wxTimer

The solution for non-blocking retrieval of the page source with a timer is to use the timer built into wxWidgets. After figuring out the calling sequence and syntax I ended up with this:

wxWebView * webView;

void OnTimer(wxTimerEvent& event) {
  std::cout << webView->GetPageSource() << std::endl;
}

int main(int argc, char** argv) {
  (...)
  const int TIMER_INTERVAL = 500;
  wxTimer * timer = new wxTimer(panel, wxID_ANY);
  panel->Bind(wxEVT_TIMER, &OnTimer);
  timer->Start(TIMER_INTERVAL, wxTIMER_CONTINUOUS);
  (...)
}

and that works as expected. Every 500ms the page source is retrieved and dumped to the console.

For lack of an easy way to get user data through the event I went with a global pointer to the wxWebView instance.

Automation

From here the rest is a matter of monitoring the incoming page source for the presence of tokens and injecting JavaScript accordingly.

Converting wxString to std string

In order to use my stringCompare, stringReplace, stringContains functions that all operate on std::string we need to convert from wxString to std::string:

wxString html_wx = m_webView->GetPageSource();
const char * c = static_cast<const char*>(html_wx.ToUTF8());
std::string s(c);

In order to get rid of the cookie banner we need to automatically click the “accept” button.

We can have a look at the “I Still Don’t Care About Cookies” [6] browser extension and inject similar JavaScript code.

The following code works perfectly to auto-click on “accept”.

if(stringContains(s, ">Alle akzeptieren</div>")) {
  std::cout << "[!] Cookie Accept Button Found" << std::endl;

  // ---
  const char * javascript = R""""(

function _sl(selector, container) {
  return (container || document).querySelector(selector);
}

const container = _sl('div[aria-modal="true"][role="dialog"]');

_sl("button + button", container).click();

  )"""";
  // ---
  
  wxString javascript_wx(javascript);
  m_webView->RunScript(javascript_wx);

}

Progress

Conclusion

With wxWdigets + wxWebView in C++ we can build a simple JavaScript capable web scrapper that runs fast and efficiently. JavaScript code can be injected and we could easily run advanced image recognition on screenshots of the wxWebView widget and trigger mouse and keyboard events. That is really all that is needed in most cases in order to efficiently scrape data from the web from C++, in a cross-platform way, with minimal amount of code.

1] https://zetcode.com/gui/wxwidgets/firstprograms/
2] https://stackoverflow.com/questions/22406511/how-compiling-wxwidgets-app-staticaly
3] https://stackoverflow.com/questions/208373/wxwidgets-how-to-initialize-wxapp-without-using-macros-and-without-entering-the
4] https://github.com/tranxuanthang/lrcget/issues/54
5] https://docs.wxwidgets.org/latest/classwx_string.html
6] https://github.com/OhMyGuus/I-Still-Dont-Care-About-Cookies/blob/master/src/data/js/8_googleHandler.js

Dennis Salzner - Web Scraper written in C++