Solana NFT Scraper - Part 2: Node.js Implementation Notes

Enver Podgorcevic, Solana NFT Scraper
Back

Introduction

In previous document in this series we found a way to automatically extract NFT pubkeys from the blockchain via RCP node requests. Previous implementation was done in bash, which turned out to be not so flexible for error handling and data filtering.

I have implemented similar logic in Node.js which is optimized and enhanced in order to scrape some NFTs which the bash implementations missed. This document contains notes created while writing the Node.js implementation, source code of which you can find here.

Quick note on libraries used

After some quick research, I settled down on using Axios library for sending https requests and JSONStream for stream parsing JSON text. They give awesome performance and can use streams so the memory usage is very low. If you have a recommendation for some other libraries appropriate for these tasks that you thing would make things better, feel free to message me 🙂

Also, a quick note on the bash implementation problem that I had: as far as I know, there is no good streaming JSON parser for bash, at least for the use case that I had here, where the file to be parsed had no newline characters. All of the parses or string matching programs that I tried consumed all of the machine memory and eventually crashed. So if you have similar problem don’t waste time on bash script and write program using some other programming language.

Time Optimization

The first improvement I made on the new script was time optimization. Sending 2 RPC requests for public key of every account owned by Token Medadata Program. (I will call these Token Metadata Program Accounts from now on.)

The bash script did the job but it was insufferably slow, it scraped with speed of around around 1700 NFTs/hour. With that pace it would take ~9 months to scrape them all.

After a quick search I found out that RPC Documentation Page says: “Requests can be sent in batches by sending an array of JSON-RPC request objects as the data for a single POST.”

I wrote a quick bash script to test it out and the results were beyond surprising. One unbatched RPC request takes about 1 second to return the response data. When I batched 200 requests and sent them it took... 1 second as well. 400? Same. More than 400? Well, I soon found out that the POST request payload size is limited.

The server says that the maxBodyLength of a request is 10485760 bytes, but I found out that I get HTTP 413 Payload Too Large for payload sizes larger than 51000 bytes.

After some quick back-of-the-envelope calculations where I divided that number by the size of one RPC request payload (assuming the payload has the longest possible pkey/signature, since public key and signature length can vary) for 3 different RPC requests that I used in the script, I found out that 270 would be a safe number of batched requests, and I implemented it that way.

I rewrote the script, started it and the results were amazing. The script now scrapes the NFTs at the speed of around 73440 NFTs/hour, which is ~43 times faster than the initial script.

At this pace, it would take something like 6 days and 6 hours to complete the script on my machine.

Scraping Improvements

The initial bash script did very good job of finding most of the NFTs that are in any way connected to some Token Metadata Program Account, but it wasn’t perfect. Around 2.7% of all the Token Metadata Program Accounts had transactions with format that wasn’t expected by the script, and these were categorized as Irregular Token Metadata Program Accounds - we will call them Irregular Accounts for the rest of the document.

After some analysis of them, 4 transaction patterns emerged which covered somewhere around 99.83% of all the NFTs. And when checked by hand, the remaining 0.17% Token Metadata Program Accounts were’t connected to any kind of useful tokens through any of their transactions.

I checked something like first 20-30 of these and they were all either Unknown Tokens, mostly with both Total and Max Supply equal to 0, or some kind of tokens with bad metadata.

I also randomly checked through the rest of the list and this seems to be the case in general, I found exactly 0 usable NFTs here.

In the next section I will present the 4 transaction patterns from which I got the majority of the NFTs scraped.

Transaction patterns

PostTokenBalances field

This method is used by the large majority (~97.3%) of NFTs. If you go to any kind of Solana marketplace and click on random NFT, the chances are that its transactions have this pattern.

The method is very simple really — Mint authority of an NFT will in most cases have its first transaction be minting that NFT. That is, mint authority of an NFT will, in its first transaction, contain instruction mintTo, which will add the mint’s public key to the postTokenBalances array in the RPC JSON response. The mint pubkey will be at the result → meta -> postTokenBalances path.

InitializeMint instruction

Look into the first transaction of the Program Account and go to the result → transaction → message. Look for initializeMint instruction. Its mint field will be the public key of the NFT.

MintTokens instruction

Look into the first transaction of the Program Account and go to the array at result → meta → innerInstructions path. Find setAuthority instruction with authorityType: mintAuthority field. If you find it, its mint field will contain public key of an NFT.

Blind Method (identifying instructions with Token Metadata Program pkey)

If none of the above works, I found out that some NFTs can be discovered by just looking for instructions with Token Metadata Account public key. (metaqbxxUerdq28cj1RbAWkYQm3ybzjb6a8bt518x1s). The mint public key is always (in all the examples I encountered) the second argument to the instruction. This method (especially) should be checked against the associated rust code, since it’s based on a guess. I checked something like 30-40 NFTs scraped this way and they were always regular NFTs, so even if it’s a guess, it should be correct in most of the cases.

© Enver Podgorcevic.RSS