[2023-09-29 Fri 12:17]

Suppose you’re experimenting with Playwright through Jupyter for HTML scraping, you’ll notice that Playwright offers two types of APIs: sync_api and async_api. (Playwright | Playwright Python)

from playwright.sync_api import sync_playwright
import asyncio
from playwright.async_api import async_playwright

The trend is towards using the async API, but several libraries are yet to catch up or might prefer to stick to sync. There’s no inherent issue with sync API. The idea is to understand and utilise the APIs differently to suit our needs.

Calling the sync API directly within the event loop isn’t advisable, since it will block the loop and hang the entire system, making it unresponsive to events like mouse movements, keyboard inputs, and incoming requests. When Playwright’s sync API refuses to function from an async loop, that’s a correct behaviour.

from llama_index import download_loader
ReadabilityWebPageReader = download_loader("ReadabilityWebPageReader")
loader = ReadabilityWebPageReader()
documents = loader.load_data(url='https://github.com/n8n-io/n8n/pull/7136')
Error: It looks like you are using Playwright Sync API inside the asyncio loop.
Please use the Async API instead.

So, we need to address this – how can we run synchronous code as needed? The strategy here is to launch it via a separate thread, and asyncio conveniently allows us to do that.

import asyncio
from llama_index import download_loader
ReadabilityWebPageReader = download_loader("ReadabilityWebPageReader")
loader = ReadabilityWebPageReader()
documents = await asyncio.to_thread(loader.load_langchain_documents, url='https://github.com/n8n-io/n8n/pull/7136')

We can also concurrently run multiple instances of the synchronous code, either via their independent threads or through a thread pool. Considering Playwright library’s operations are I/O bound, waiting for the response can be significant. However, bear in mind that Playwright itself can launch multiple browsers, which could be quite CPU intensive.

import asyncio
from llama_index import download_loader
ReadabilityWebPageReader = download_loader("ReadabilityWebPageReader")
loader = ReadabilityWebPageReader()

# Spawn one thread per request
documents = await asyncio.gather(*[asyncio.to_thread(loader.load_langchain_documents, url=url) for url in [
    'https://github.com/n8n-io/n8n/pull/7136',
    'https://github.com/n8n-io/n8n/pull/7136',
    'https://github.com/n8n-io/n8n/pull/7136',
    'https://github.com/n8n-io/n8n/pull/7136'
]])

# Spawn a pool of 2 threads and run 4 instances of the code in the pool
pool = ThreadPoolExecutor(max_workers=2)
documents = await asyncio.gather(*[asyncio.get_event_loop().run_in_executor(pool, lambda: loader.load_langchain_documents(url=url)) for url in [
    'https://github.com/n8n-io/n8n/pull/7136',
    'https://github.com/n8n-io/n8n/pull/7136',
    'https://github.com/n8n-io/n8n/pull/7136',
    'https://github.com/n8n-io/n8n/pull/7136'
]])
pool.shutdown()
scraped: https://github.com/n8n-io/n8n/pull/7136
scraped: https://github.com/n8n-io/n8n/pull/7136
scraped: https://github.com/n8n-io/n8n/pull/7136
scraped: https://github.com/n8n-io/n8n/pull/7136
scraped: https://github.com/n8n-io/n8n/pull/7136
scraped: https://github.com/n8n-io/n8n/pull/7136
scraped: https://github.com/n8n-io/n8n/pull/7136
scraped: https://github.com/n8n-io/n8n/pull/7136

You might notice that the first four requests might be quicker, being scraped simultaneously while the latter are processed in batches of two - slower, but using lesser CPU. This is the performance trade-off we need to account for.

Happy code experiments!