Solana NFT Scraper - Part 1: Introduction

Enver Podgorcevic, Solana NFT Scraper
Back

Goal

The goal of this research is to find a way to collect the information about as much useful NFTs on the market as possible. Useful NFTs are the ones that have some sort of value on the market and probably some kind of data they hold.

Possible strategies for approaching this problem

Web scraping (The incomplete way)

Method outline

Build a web scraper for some of the most popular NFT marketplaces.

Pros

It’s really quick and efficient way of collecting information about popular NFTs. It has short delay when discovering newly minted NFTs on the marketplaces we scrape NFTs off of.

Cons

We would have to build a new scraper for every page we want to scrape, and would have to update the scrapers whenever the layout of the page we scrape changes.

Other con is that this is by no means exhaustive search of the NFT space. We would have to constantly update the marketplace list and it would be hard to find NFTs which aren’t sold on popular marketplaces.

Also, there’s an issue of legality. Every page nowadays has a file named robots.txt which is placed in the site’s root folder. That file specifies everything about what kind of scraping and scrapers it allows it’s users to do. Not respecting it could cause legal problems.

Here’s the example of Solanart’s robots.txt file.

Most solana NFT matketplaces allow all kind of scraping at the moment, but chaning that in the future would be as easy as updating their robots.txt to somehing like this.

Blockchain scraping (The caveman way)

Method outline

Since the Solana blockchain is publicly available piece of data, one possible approach would be to first detect some kind of a trace that creating an NFT leaves on blockchain, and then go through all the blockchain from ye olden days up until now and thus detect all the NFTs. Similar type of thing has already been done on other crypto blockchains. You can read a paper on Ethereum data extraction here.

Pros

This approach, if done correctly, would yield all the Solana NFTs and all other kinds of useful Solana blockchain statistics. Awesome, let’s implement it.

Cons

Untitled

Not so fast.

This approach would be incredibly computationaly intensive task. Going through all of the blockchain is not easy.

First of all, how big is the Solana mainnet blockchain anyway? According to Solana co-founder Anatoly Yakovenko’s answer from May 2021, it grows 2TB per year and it’s stored on Arweave. Cool.

But according to this guy’s blog about his Solana blockchain analysis tool, “As of June 2021 Solana generates close to 100 GB of raw data per day”, so that’s something to consider.

And since the Solana is probably going to get more popular, blockchain size will grow faster with time.

Also, there’s the issue of parallelization. The task of reading through the whole blockchain isn’t easily parallelized.

This solution is probably not the best one out there since in order to find what we need we need to go through a lot of garbage data that we’re currently not interested in. Let’s see if there are any other better ways of doing it.

JSON RPC API (The sophisticated gentleman way)

Method outline

Solana has a list of RPC methods which can be invoked by sending HTTP POST request with appropriate JSON query parameters. It could be possible to get what we want by chaining several steps that consist of invoking properly chosen RPC methods and filtering the results we get from them.

Pros

With this method we wouldn’t have to got through all of the blockchain ourselves, we could query the solana nodes and get much more narrow results. Also, if done correctly, this method should should yield all the NFTs on the blockchain.

Other pro is that most of the steps here should easily be parallelized. More on that later.

Cons

RPC request and specifications could change in the future so we’d have to update the automatic scripts accordingly.

And the winner is...

The JSON RPC API method seems like the best candidate for this job. It should yield all the NFTs and it shouldn’t be that computationally intensive and Sisyphean as going through the whole blockchain.

So, fasten your seat belts, pack your bags, we’re going on an NFT scraping adventure.

Untitled

JSON RPC API NFT Scraper

Requirements

This method makes heavy use of Linux terminal tools. I tried them out in zsh and bash but everything should work in your POSIX-compatible shell of choice.

You will need the following packages installed: curl, grep, awk, split, jq, head, diff.

Definition

Since our task is to find all the Metaplex NFTs on the Solana blockchain, first thing we need is the definition of a Metaplex NFT, because how can we know we have what we want if we don’t know what we want in the first place.

The Metaplex Terminology page defines Non-fungible tokens as:

Metaplex's non-fungible-token standard is a part of the Solana Program Library (SPL), and can be characterized as a unique token with a fixed supply of 1 and 0 decimals. We extended the basic definition of an NFT on Solana to include additional metadata such as URI as defined in ERC-721 on Ethereum.

So, Metaplex NFTs are similar to Solana NFTs but they also have some sort of metadata attached to them.

Now we know what we’re after, jolly good!

Metaplex NFT structure and account hierarchy

Let’s take a look at a random Metaplex NFT and see what we can infer about its relationships with other Accounts and Programs.

Fire up your terminal and execute the following command:

curl http://api.mainnet-beta.solana.com/ -X POST -H "Content-Type: application/json" -d '
  {
    "jsonrpc": "2.0",
    "id": 1,
    "method" : "getAccountInfo",
    "params": [
      "489yfNQvxwYTovjbZNKxz3ZbPGdCFGHkKGsmsvMDd2TR",
      {
        "encoding": "jsonParsed",
				"commitment": "confirmed"
      }
    ]
  }' | jq

It should return the following neat-looking (thanks to jq) JSON data:

{
  "jsonrpc": "2.0",
  "result": {
    "context": {
      "slot": 112588767
    },
    "value": {
      "data": {
        "parsed": {
          "info": {
            "decimals": 0,
            "freezeAuthority": "7LDqtwXUDz4Tzk9vfWn4sdDXz2Pv1F1QjbYstFLEgTEc",
            "isInitialized": true,
            "mintAuthority": "7LDqtwXUDz4Tzk9vfWn4sdDXz2Pv1F1QjbYstFLEgTEc",
            "supply": "1"
          },
          "type": "mint"
        },
        "program": "spl-token",
        "space": 82
      },
      "executable": false,
      "lamports": 1461600,
      "owner": "TokenkegQfeZyiNwAJbNbGKPFXCWuBvf9Ss623VQ5DA",
      "rentEpoch": 260
    }
  },
  "id": 1
}

A lot of data here. Let’s compare it to a plain old fungible Solana token that I created with spl-token tool on devnet. Execute the following command:

curl http://api.devnet.solana.com/ -X POST -H "Content-Type: application/json" -d '
  {
    "jsonrpc": "2.0",
    "id": 1,
    "method": "getAccountInfo",
    "params": [
      "Acfq2898yLunrvvKrMT92PrrjK4r2LCqVtCBtYdaL6Yk",
      {
        "encoding": "jsonParsed", "commitment": "confirmed"
      }
    ]
  }' | jq

We get the following result:

{
  "jsonrpc": "2.0",
  "result": {
    "context": {
      "slot": 102391108
    },
    "value": {
      "data": {
        "parsed": {
          "info": {
            "decimals": 0,
            "freezeAuthority": null,
            "isInitialized": true,
            "mintAuthority": "7bTgmqdVoukmfZmbkpL4gFpoTn5ch34CnYC7Wyd36aXG",
            "supply": "0"
          },
          "type": "mint"
        },
        "program": "spl-token",
        "space": 82
      },
      "executable": false,
      "lamports": 1461600,
      "owner": "TokenkegQfeZyiNwAJbNbGKPFXCWuBvf9Ss623VQ5DA",
      "rentEpoch": 237
    }
  },
  "id": 1
}

Let’s put these results in two files, metaplexNFT.json and solanaFT.json and compare them with the following command:

diff metaplexNFT.json solanaFT.json

Which yields the following result:

5c5
<             "slot": 112588767
---
>             "slot": 102391108
13c13
<                         "freezeAuthority": "7LDqtwXUDz4Tzk9vfWn4sdDXz2Pv1F1QjbYstFLEgTEc",
---
>                         "freezeAuthority": null,
15,16c15,16
<                         "mintAuthority": "7LDqtwXUDz4Tzk9vfWn4sdDXz2Pv1F1QjbYstFLEgTEc",
<                         "supply": "1"
---
>                         "mintAuthority": "7bTgmqdVoukmfZmbkpL4gFpoTn5ch34CnYC7Wyd36aXG",
>                         "supply": "0"
29c29
<             "rentEpoch": 260
---
>             "rentEpoch": 237

As we can see, they differ in several places. First, they have different slot , which is not surprising since the were created at different times.

Next, we can see that the Metaplex NFT has non-null freeze-authority . Setting freezeAuthority to null makes the account freezing and thawing permanently disabled. But we can change this field easily and make Solana Tokens have freeze authority or Metaplex NFTs not have one so this field will not help us much.

Next in line we have 3 fields, mintAuthority, supply and rentEpoch. supply and rentEpoch aren’t that interesting since they can be the same in both types of tokens, so let’s focus on Mint Authority of our NFTs and explore them next.

Mint authority

The mint authority address of our Metaplex NFT was 7LDqtwXUDz4Tzk9vfWn4sdDXz2Pv1F1QjbYstF LEgTEc. Let’s execute the following command and see what we get:

curl http://api.mainnet-beta.solana.com/ -X POST -H "Content-Type: application/json" -d '
  {
    "jsonrpc": "2.0",
    "id": 1,
    "method": "getAccountInfo",
    "params": [
      "7LDqtwXUDz4Tzk9vfWn4sdDXz2Pv1F1QjbYstFLEgTEc",
      {
        "encoding": "jsonParsed", "commitment": "confirmed"
      }
    ]
  }' | jq

The result is:

{
  "jsonrpc": "2.0",
  "result": {
    "context": {
      "slot": 112597071
    },
    "value": {
      "data": [
        "BgAAAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
        "base64"
      ],
      "executable": false,
      "lamports": 2853600,
      "owner": "metaqbxxUerdq28cj1RbAWkYQm3ybzjb6a8bt518x1s",
      "rentEpoch": 260
    }
  },
  "id": 1
}

We see that this account is data account — not executable. Contains some data in the data field which is base64 encoded. Among other data, only the owner field seems like it could potentially be interesting for us. Let’s investigate it a bit more and see what we get.

The owner of the Metaplex NFT’s Mint Authority

If we head over to the Solana Explorer page and search for the owner address we previously obtained, we can see that this is the address of the Token Metadata Program from Metaplex.

Untitled

Jolly good! Now we have some starting point for searching all the Metaplex tokens, since all of them should be connected to this metadata account. (This should be checked more rigorously if we want to be completely sure that we’re not missing any Metaplex NFTs, but right now it seems that this is the case.)

Finding all the Metaplex Mint Authorities

Getting the JSON data

After poking around the Solana RPC documentation, I found out that this command can be used to get the list of all the accounts that are owned by Token Metadata Program:

curl http://api.mainnet-beta.solana.com/  -X POST -H "Content-Type: application/json" -d '   
	{
		"jsonrpc":"2.0",
		"id":1,
		"method":"getProgramAccounts",
		"params":[
			"metaqbxxUerdq28cj1RbAWkYQm3ybzjb6a8bt518x1s",
			{
				"encoding": "base64"
			}
		]
	}' >> mint_authorities.json

You can go rest and make yourself some coffee now, since this command will take some time to finish. In my experience it takes about 2 minutes for Solana RPC node to find all the accounts this command asks for (If it takes more than 5 minutes you might as well terminate the command and start over because in my experience you will get timeout error if it takes longer than this). While the node is finding the data, the Total column will show some seemingly random number, and Average Speed → Dload column will show 0.

When the RPC node computes the data, these columns will change and the download will start. The size of this JSON download at the time of writing this is around 7.8 GB.

This download method might fail sometime due to several kinds of errors. Sometimes the URL is unresponsive and the download doesn’t start, sometime if the internet is bad there’s a timeout error. If these are serious drawbacks let’s talk about it and I will try to refine the command.

Filtering the JSON data

When the previous command finishes and we have our ~8GB monster JSON inside of mint_authorities.json file, we will need to extract the useful data somehow. Don’t try to open the file with a text editor, it will probably freeze your computer.

You can peek at the data with the following command:

head -c 1000 mint_authorities.json

This command prints the first 1000 bytes of a file. The result should look something like this:

{"jsonrpc":"2.0","result":[{"account":{"data":["BgAAAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA","base64"],"executable":false,"lamports":2853600,"owner":"metaqbxxUerdq28cj1RbAWkYQm3ybzjb6a8bt518x1s","rentEpoch":260},"pubkey":"59jbsfo9sC8hHCwFJcwL4w4DPzWqAqZqjw2dd6iZQuhW"},{"account":{"data":["BAVdPXEZbWu6xMbuBHXt/MfCwDDy5UXqx6aSmt3Qt5BCgqFPweVYdDjaR7278FMceLiVh6jUGK9lNSQDgeyHPYMgAAAARm94ICMyOTQ1AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAKAAAAVEZGAAAAAAAAAMgAAABodHRwczovL2lwZnMuaW8vaXBmcy9RbVM5ZWJZUU1jejhtVEhrQ0VHV2Naako0QnV3WXhKOGk5ZjlqcE55Rlk5RVVKAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA..."

A lot of unreadable information here. To make it even harder to read, the JSON doesn’t contain any new lines. I added the ellipsis and the closed quotation mark at the end so that we at least have nice coloring.

Now, back to our main quest here. We’re interested in finding all the “pubkey” fields in this JSON, since they’re all the Mint Authority accounts owned by the Token Metadata Program by Metaplex.

This was not an easy step and I will probably write a short article on how I solved it and link it here. The main problem was that the file is so big and doesn’t contain any newlines. I couldn’t find any command line solutions to this problem and I wrote a short Node.js pubkey extractor. It’s probably not the most efficient solution since it takes about 1-2 minutes to finish on my machine, but it’s acceptable for now. If we need quicker solution we can implement it in a more efficient language like rust or c++.

Mapping Mint Authority to NFT

The Solana Explorer way

Mint Authority, in my current understanding, is responsible for all the transactions related to minting and transferring of its associated NFTs. If we copy the pubkey we got in the previous step and paste it in the Solana explorer, we see the following data:

Untitled

The interesting part here is Transaction history . Only one transaction in this case, and here’s the info it contains:

Untitled

I have intentionally hidden the Account Input(s) table so as to make the picture contain less unnecessary info. The key part of the image is the Token Balances table. It shows all the net token mint ownership changes at the end of the transaction

We can see that the address 3UxXoQ3JLEwz19B4fwLxC6NHPpYFKdmNm1Qx7JTmE4tN received one token with address 3y7WeZC6ScFNAn9F7C2kBjrWu5Pk1Qq7sA4xskUVNNjX.

And sure enough, if we click on that address, we see the NFT image with all the relevant metadata — we have found the NFT Master Edition address!

Untitled

Automating the Scraping

From the previous steps we see that it’s definitely possible to find NFT address given the Mint Authority address. But it would be nice if we could somehow automate the process. We could go and search the Solana RPC documentation page for the right method to invoke, but we have all the answers in the Solana Explorer page, all we need to do is track the RPC requests it sends when we perform the Solana Explorer Mint-Authority-to-NFT translation.

Open up the Developer Tools (F12) and navigate to the Network tab. There’s a handful of requests sent and I encourage you to go and check them out. The two important ones for us are [getSignaturesForAddress](https://docs.solana.com/developing/clients/jsonrpc-api#getsignaturesforaddress) and [getTransaction](https://docs.solana.com/developing/clients/jsonrpc-api#gettransaction).

Let’s go to the command line and see what exact information we get from these.

curl http://api.mainnet-beta.solana.com/  -X POST -H "Content-Type: application/json" -d '
  {
    "jsonrpc": "2.0",
    "id": 1,
    "method": "getSignaturesForAddress",
    "params": [
      "59jbsfo9sC8hHCwFJcwL4w4DPzWqAqZqjw2dd6iZQuhW"
    ]
  }
' | jq

Results in the following JSON data:

{
  "jsonrpc": "2.0",
  "result": [
    {
      "blockTime": 1633291739,
      "confirmationStatus": "finalized",
      "err": null,
      "memo": null,
      "signature": "5Ldc6ypG77Qe4uAGrzTpEAAS9nVA1R1mVyqtF4uddMccvvEamAXGFw89rLK4EykHKsx5q92ftqFJ6jj1ok6VekaK",
      "slot": 99614699
    }
  ],
  "id": 1
}

To extract the signature from this JSON, we can tell jq filter it out for us with the following command:

curl http://api.mainnet-beta.solana.com/  -X POST -H "Content-Type: application/json" -d '
  {
    "jsonrpc": "2.0",
    "id": 1,
    "method": "getSignaturesForAddress",
    "params": [
      "59jbsfo9sC8hHCwFJcwL4w4DPzWqAqZqjw2dd6iZQuhW"
    ]
  }
' | jq -r '.result[].signature'

The -r options outputs string without surrounding quotes, and the ‘.result[].signature’ filter that passes through just the signatures.

Note #1: Sometimes there’s more than one objects in the “result” array. They are signatures of transactions for minting, making copies from Master Edition and transfer of token ownership. Currently no filtering is done on that part since it’s inconvenient to do it in a bash script, but once we have it implemented in node.js, the filtering will be customizable.

Note #2: The RPC doc page states that the “result” array will be limited to at most 1000 elements. But there are another configuration fields named before and until by which we can control the boundary transactions in our search. If none of these are specified then the transactions are shown from the most recent one to the oldest. From what I’ve seen, most of the Mint Authorities sign several transactions, one for minting the Master Edition, several for minting Print NFTs and several for their transactions. It’s mostly below 10, but who knows what kind of data we’re going to stumble upon while searching the whole blockchain, so I think this could be valuable information if we want to make a complete search.

Now we need to get the relevant information from the transaction data. To see all the transaction data, execute the following command:

curl http://api.mainnet-beta.solana.com/ -X POST -H "Content-Type: application/json" -d '
  {
    "jsonrpc": "2.0",
    "id": 1,
    "method": "getTransaction",
    "params": [
      "5Ldc6ypG77Qe4uAGrzTpEAAS9nVA1R1mVyqtF4uddMccvvEamAXGFw89rLK4EykHKsx5q92ftqFJ6jj1ok6VekaK",
      "json"
    ]
  }
' | jq

It results in a lot of data, which might be concerning later when we need to fetch all this data quickly, but right now we just want to find a way to get the NFT pubkey. This is the result of the previous command:

{
  "jsonrpc": "2.0",
  "result": {
    "blockTime": 1633291739,
    "meta": {
      "err": null,
      "fee": 10000,
      "innerInstructions": [
        // A lot of data here
      ],
      "logMessages": [
        // A lot of data here
      ],
      "postBalances": [
        // Some other data here
      ],
      "postTokenBalances": [
        {
          "accountIndex": 2,
          "mint": "3y7WeZC6ScFNAn9F7C2kBjrWu5Pk1Qq7sA4xskUVNNjX",
          "uiTokenAmount": { //          ^ NFT Pubkey
            "amount": "1",
            "decimals": 0,
            "uiAmount": 1,
            "uiAmountString": "1"
          }
        }
      ]
    }
		// A lot of data here
  }
  "id": 1
}

We can add another jq filter to get just the NFT pubkey, like this:

curl http://api.mainnet-beta.solana.com/ -X POST -H "Content-Type: application/json" -d '
  {
    "jsonrpc": "2.0",
    "id": 1,
    "method": "getTransaction",
    "params": [
      "5Ldc6ypG77Qe4uAGrzTpEAAS9nVA1R1mVyqtF4uddMccvvEamAXGFw89rLK4EykHKsx5q92ftqFJ6jj1ok6VekaK",
      "json"
    ]
  }
' | jq -r '.result.meta.postTokenBalances[].mint'

Which results in 3y7WeZC6ScFNAn9F7C2kBjrWu5Pk1Qq7sA4xskUVNNjX.

🎉 v o i l l a 🎉

We have successfully extracted the NFT pubkey from its Mint Authority. There’s also a script that does all this automatically.

Epilogue

We have found a way to (probably) find all the Metaplex NFTs on the marketplace in any given time. After we actually obtain all the NFT pubkeys with this method, we will be able to empirically check if any NFTs from the marketplaces are missing.

The next steps will be to rewrite this whole process in a proper programming language, probably node.js or rust, and to optimize every step along the way, so that we can actually get all the NFTs in a practical amount of time.

Useful resources

Data Dragon

JSON RPC API

Solana Token List

ExtractingDataFromEthereum.pdf

© Enver Podgorcevic.RSS