top of page
  • Writer's pictureJerom Jo Manthara

Free Amazon Keyword Rank Tracking Tool


Free Amazon Keyword Rank Tracking Tool

In the dynamic world of digital commerce, businesses and vendors encounter a vast and ever-evolving landscape on platforms like Amazon. With an abundance of products spanning in the category, growing in organic search within the  marketplace is tough. Enter the realm of keyword tracker tools, a solution designed to bring order to this digital chaos. These tools allow businesses to streamline their product searches, providing organized and structured insights into the vast array of offerings available. By filtering products by category and tracking their rankings, businesses can gain invaluable insights into market trends and consumer preferences.

However, traditional keyword tracker tools often come with a price tag, placing them out of reach for many small businesses operating on tight budgets. 

The two main problems or limitations associated with traditional keyword tracker tools for Amazon are:

Cost: Traditional keyword tracker tools often come with a significant price tag, making them inaccessible for many small businesses operating on tight budgets. The high cost can pose a barrier to entry for smaller sellers who may not have the financial resources to invest in expensive tools.

Accessibility: Another issue is accessibility. Some traditional keyword tracker tools may have complex interfaces or require technical expertise to use effectively. This can be a challenge for smaller businesses with limited resources or less experienced users who may struggle to navigate the tool's features efficiently.

Recognizing this gap, we present our solution: the free Amazon Product Keyword Rank Tool anyone can use and cusomize.

Tailored specifically for businesses navigating the Amazon marketplace, our tool offers a cost-effective alternative to paid subscriptions. By leveraging our tool, businesses can effortlessly monitor the rankings of the top 10 products within a chosen category, based on their keyword. With the ability to track changes over a three-day period, businesses can stay agile and responsive in an ever-evolving market environment.

Amazon Advantages: Why Choose the Marketplace Giant?

  • Vast Selection: Amazon boasts an extensive range of products, catering to diverse business requirements. Whether sourcing everyday essentials, niche items, or specialized equipment, Amazon offers unparalleled variety.

  • Competitive Pricing: With competitive prices and frequent discounts, Amazon ensures businesses can maximize their purchasing power. Compare prices effortlessly and secure the best deals for your business needs.

  • Convenience: Enjoy the convenience of 24/7 shopping from anywhere, with orders delivered directly to your business premises. Save time and streamline procurement processes with Amazon's hassle-free shopping experience.

  • Safety and Security: Rest assured knowing your business information and payment details are safeguarded with Amazon's robust security measures. Trust in Amazon's commitment to protecting your data throughout every transaction.

How does the Amazon Keyword Rank Tracking Tool Work?

Our tool made using the Python programming language uses a custom scraper to obtain data across each day. The product listings across three consecutive days are collected and the tool displays today’s top 10 products and also their rank difference across the previous two days as well - whether the product rank has increased or decreased or even whether the product was present in the previous day’s ranks. 

The tool then displays the collected data in a tabular structure which can be sorted according to the need. The tabular data can also be downloaded. The tool also provides methods to add new keywords for analysis and for deleting existing keywords as well.

What this basically means is that our tool basically works as an Amazon Keyword Tracker that monitors the product listings and rankings of the keywords that have been entered, thus allowing the user to track progress and adjust the keyword strategy based on performance.

Now let us look into the more intricate working of our tool. 

Our tool mainly consists of two parts - the keyword scraper and the frontend user interface. Let's look into the both of them and see how they are connected together as well.

Tracking down information with the Scraper

Before looking into our scraper let's look at what scraping generally is.

Web scraping is the process of automatically extracting data from websites. It can be seen as a little robot that goes through a website, collects data and organizes it just like a spider weaving its web. 

Importing Libraries

import os
import asyncio
import pandas as pd 
from datetime import datetime, timedelta
from playwright.async_api import async_playwright

The scraper we have made uses the following libraries:

  • os module

  • asyncio

  • pandas

  • datetime

  • playwright

os module for file management

The os module in Python is a powerful tool for interacting with the operating system. It provides a wide range of functions for performing tasks like managing files and directories, launching processes, querying system information, and interacting with the command line.

The greatest benefit that os module gives us is its portability and simplicity. The os module provides a consistent interface for accessing operating system features across different platforms (Windows, Mac, Linux). The module offers a simple and concise API for common tasks, making your code easier to read and maintain.

The scraper uses the os module to ensure that files containing data from more than two days back are deleted. It is also used to check if the file containing the current day’s data is present.

Asyncio: Extracting data with async

Asyncio, short for Asynchronous I/O, is a powerful library in Python that allows one to write concurrent code in an efficient and straightforward way. The major benefit of Asyncio is therefore that it enables one to run multiple tasks concurrently, without having to wait for each one to finish before starting the next. 

This leads to the concept of asynchronous programming which will be explained later.

Pandas: For working with tabular data

Pandas is a Python library that provides high-performance, easy-to-use data structures and data analysis tools. It is commonly used for data cleaning, manipulation, and analysis.

The DataFrame data structure in Pandas is a tabular data structure similar to a spreadsheet. It is used to store and organize data in rows and columns.

In the context of web scraping, the DataFrame data structure can be used to:

  1. Store the extracted data from a web page

  2. Clean and manipulate the data

  3. Analyze the data

  4. Export the data to a file

  5. We use the .to_csv() method in Pandas to export a DataFrame to a CSV file. A CSV file is a text file that stores tabular data in rows and columns, separated by commas.

Datetime

The datetime module in Python is a powerful tool for working with dates and times. It provides functions and classes for representing, manipulating, and formatting dates and times in various ways.

In our tool,  we store data extracted each day into a csv file having the name as the date of the day in which the extraction was done. This helps us to ensure that scraping isn’t done uselessly and also to retain only 3 days worth of data and no more.

Playwright

Playwright is a relatively new automation framework that supports multiple browsers, including Chrome Headless, Firefox Headless, and Edge Headless. Playwright is known for its speed and reliability, and it is often used for automated testing and web scraping. Automation frameworks are often integrated with headless browsers to provide a more powerful and efficient way to automate web testing. Headless browsers allow automation frameworks to run tests without opening a visible browser window, which can improve performance and reduce resource usage.

So what is a headless browser??

A headless browser is a web browser without a graphical user interface (GUI). This means that it runs without a visible window or tabs. Headless browsers are often used for automated tasks such as web scraping, automated testing, and other interactions with websites where a visible browser is not necessary.

Playwright offers methods of both asynchronous and synchronous programming and our tool uses the asynchronous method by calling the ‘asynch_playwright’ module from the Playwright library.

Asynchronous programming can seem like a complex concept, but at its core, it's a way of writing code that can handle multiple tasks at the same time, without slowing down or blocking the main thread. Asynchronous programming can be seen as something similar to juggling multiple balls in the air instead of throwing them one at a time.

Asynchronous programming offers the following benefits:

  • Improved performance: By handling multiple tasks concurrently, asynchronous code can be significantly faster than traditional code. 

  • Enhanced responsiveness: Program feels smoother and more responsive because it's not stuck waiting for single tasks to finish. We can interact with the program even while slower tasks are running in the background. 

  • Scalability: Asynchronous code scales well when dealing with a large number of concurrent requests or tasks, making it ideal for high-traffic applications

Main Working

Let’s look into the main working of the scraper. Keep in mind that we have created two scrapers for two situations and they have a bit of a difference in their implementation. The following description will be a generalized one based on the two scrapers. Each scraper will be introduced later. 

The scraper starts by creating a new context manager block that uses the playwright library.

Context managers help manage resources and ensure that certain setup and teardown operations are performed in a clean and predictable way. They are commonly used for tasks like file handling, database connections, or, in this case, browser automation. The ‘with’ statement in Python is typically used while initializing context managers.

Playwright's asynchronous environment is set up using 'async_playwright()' which helps with configuring and launching web browser instances for automation and using asynchronous programming methods. 

In the context manager, the result of calling async_playwright() is assigned to the variable pw. This allows us to reference the Playwright environment and its functionality within the context block using the variable pw.

One of the advantages of using a context manager is that it automatically performs cleanup actions when we exit the block. In the case of Playwright, it ensures that the browser instances are properly closed when we use the block. This helps prevent resource leaks and ensures that browser instances are shut down gracefully.

After setting up the context manager, a new Firefox browser instance for web scraping is launched. Note that Firefox for Playwright should be separately installed using the command “playwright install firefox”.

A context is then created for the browser. A context is like a separate browsing session. A new page is created within this context, which will be used to navigate and scrape data from web pages.

The script then opens the Amazon website and begins the scraping action. It finds the search tab and enters the relevant keywords to be searched for. Rather than using the whole search term that a customer might use, we use the relevant keywords as they allow us to have a much more general approach. Users can also use backend keywords instead of the relevant ones to specify the type of product that they want to analyze. These backend keywords can be obtained by simply using a different language while searching, by using various synonyms for relevant keywords, by specifying a particular property or even by using brand or model names.

The  names of 10 products excluding sponsored ones are extracted in order of the keyword rankings. Keyword rankings are the position of products in the search results for specific keywords used. 

These keyword rankings are set up by considering various factors like Amazon SEO, conversion rates and potential customers of the product. 

  • Amazon SEO refers to optimizing product listings and content on Amazon to increase visibility in search results. These optimizations are necessary for reaching potential customers who shop on Amazon.

  • Potential customers are people who might be interested in buying the various products available on Amazon.

  • Conversion rates are the percentage of visitors who visit the product that actually make a purchase. 

The extracted data is then stored in a pandas Dataframe and then in the end written to a csv file.

Finally, the browser is closed to end the scraping session.

Let’s look into the Global Constants and various functions used by the scraper to complete execution.


Global Constants

BASE_URL = 'https://www.amazon.in/'
BASE_SEARCH_URL = 'https://www.amazon.in/s?k='
CSV_FILE_CURRENT_DAY = './data/' + str(datetime.today().date()) + '.csv'
CSV_FILE_PREVIOUS_DAY = './data/' + str(datetime.today().date() - timedelta(days=1)) + '.csv'
CSV_FILE_PREVIOUS_DAY_2 = './data/' + str(datetime.today().date() - timedelta(days=2)) + '.csv'
KEYWORDS_FILE = "./data/keywords.csv"
KEYWORDS = get_keywords()

The following constants are initialized at the beginning as they are used by various functions.

  • BASE_URL: The home URL of the site that we are scraping, which is Amazon in our case.

  • BASE_SEARCH_URL: The common initial part of the urls used when a particular word is searched. By adding a word to the end of this URL we can obtain the same result as using the search tab.

  • CSV_FILE_CURRENT_DAY: The name of the csv file used to store the latest data extracted. It is the date of the current day as a string.

  • CSV_FILE_PREVIOUS_DAY: The name of the csv file containing the previous day’s data. It is the date of the previous day as  a string. 

  • CSV_FILE_PREVIOUS_DAY_2: The name of the csv file containing the data of two days back. It is the date of the day two days back as  a string. 

  • KEYWORDS_FILE: The name of the csv file containing the list of keywords that are used to get the keyword rankings.

  • KEYWORDS: The list of keywords obtained by executing the custom ‘get_keywords()’ function.

Obtaining the list of keywords present

def get_keywords():
  """
  The function retrieves the keywords to be searched for from the keywords.csv file.
  This is done so that keywords can be updated through the web app.
  The changes have to be made available in the scraper code as well.
  Thus, we use a function to retrieve the keywords from the keywords.csv file.
  The function also ensures that the keywords are sorted alphabetically.

  :return: Sorted List of keywords to be searched for.
  """
  df = pd.read_csv(KEYWORDS_FILE)
  keyword_list = df['Keywords'].tolist()
  sorted_keyword_list = sorted(keyword_list)
  return sorted_keyword_list

The relevant keywords are needed by both the frontend and scraper. Therefore we store it in a csv file. This ensures that the data updated on one side remains consistent with the other and so on. 

The csv is opened into a dataframe using pandas and then turned into a list and sorted. This sorted list is then returned.

Reaching the Product Listing Page(PLP) of a particular Keyword

async def find_page(page, keyword):
    """"
    Asynchronous function to find the page that is obtained on searching the given keyword.
    This function will first try to find the search box and fill it with the keyword,
    if the search box is not found, it will go to the search page directly,
    it goes directly to the search page by editing the BASE_SEARCH_URL and adding the keyword.

    :param page: Playwright page object
    :param keyword: Keyword to search for
                    Is a string
    :return: Playwright page object
    """
    try:
        search_box_container = await page.wait_for_selector('div[class="nav-search-field "]')
        search_box = await search_box_container.query_selector('input')
        await search_box.fill(keyword)
        await search_box.press('Enter')
    except:
        await page.goto(BASE_SEARCH_URL + keyword, wait_until='load', timeout=100000)
    await page.wait_for_timeout(10000)  # Allow time for dynamic content loading
    return page

In the Amazon website, we have to reach pages that contain the product listings belonging to each category depicted by the relevant keywords. This is done by searching for the keyword by entering it into the search tab. This will bring us to the page that lists the products of the entered keyword.

Sometimes, our scraper cannot find the search bar. This occurs due to the captcha set up by Amazon. In such cases search is done by adding the keyword at the end of the BASE_SEARCH_URL global constant.

Finding sponsored products

async def is_sponsored(product):
    """
    Asynchronous function to check if the product is sponsored.
    This function will first check if the product is sponsored,
    if the product is sponsored, it will return True,
    if the product is not sponsored, it will return False.

    :param page: Playwright object
    :return: Boolean
    """
    if await product.query_selector(":has-text('Sponsored')"):
        return True
    return False

When searching for something, Amazon provides some sponsored products. The data of these products are not extracted nor considered as they can disrupt our analysis. The "is_sponsored()" function is designed to determine if a product present is sponsored or not.

Product data extraction

async def scrape_products(page):
    """
    Asynchronous function to scrape the products from the search page.
    This function will first get the list of products from the search page,
    checks if the product is sponsored,
    if the product is sponsored, it will skip it,
    if the product is not sponsored, 
    it will add the product name to the list of product names.
    The function returns the list once 10 products have been scraped.

    :param page: Playwright page object
    :return: List of product names
    """
    product_list = []
    products = await page.query_selector_all('div[data-cy="title-recipe"]')
    product_count = 0
    for product in products:
        if product_count == 10:
            break
        if await is_sponsored(product):
            continue
        product_name = await get_product_name(product)
        if product_name in product_list:
            continue
        product_list.append(product_name)
        product_count += 1
    return product_list

From the PLP that we reach on searching using a relevant keyword, ten products are to be extracted in order of the keyword rankings. As we said earlier sponsored products are ignored.

We obtain the list of all products present in the PLP by looking for a <div> element having an attribute ‘data-cy’ with value ‘title-recipe’.

Within this function, a separate function is used to obtain the product name as some different cases arise while getting the name.


get_product_name(product)

async def get_product_name(product):
    """
    Asynchronous function to get the product name.
    This function will first get the <h2> tags from the product,
    the name of the first <h2> tag is obtained,
    if there are more than one <h2> tag,
    the text of the second <h2> tag is obtained,
    it is checked if the product name already obtained is present in the new text,
    if it is present, it is removed from the new text,
    it is added to the product name to complete it.
    This is done because some products have more than one <h2> tag,
    and names might repeat themselves in multiple <h2> tags.

    :param product: Playwright object
    :return: name string
    """
    product_name = await product.query_selector_all('h2')
    name = "".join(await product_name[0].text_content())
    if len(product_name) > 1:
        remaining_name = await product_name[1].text_content()
        if name in remaining_name:
            remaining_name = remaining_name.replace(name, '')
        name += " " + remaining_name
    return name

Here we use the Playwright object obtained earlier and searches for an <h2> tag within it. 

  • The text content of the first <h2> tag obtained in this way is taken and stored. Now there are two cases that can arise.

  • In some cases products have only one <h2> tag and thus the earlier stored value is used as the product name.

  • In cases where there are more than one <h2> tag, we take the text content of the second <h2> tag to see if the earlier obtained name is repeated here. If repetition occurs, then the repeated part is removed and the rest is added to the name. If no repetition occurs then the whole text content is added to the name.

  • There are multiple checks that occur because:

  • the product name can appear entirely in a single <h2> tag.

  • the product name can appear in two <h2> tags with the first containing the brand name and the second containing the product name.

  • In some cases the brand occurring in the first <h2> tag is repeated in the second tag as well.

Now let’s look at each specific scraper, used for two purposes - for extracting data belonging to a single keyword and for extracting data for all available keywords.

Single category scraper

async def single_category_scraper(keyword):
    """
    Asynchronous function to scrape the products belonging to a single category.

    begins with creating a new playwright instance
    then a new browser is launched
    here we use firefox
    firefox for playwright is seperately installed using the command - 'python -m playwright install firefox'
    a new context is created for the browser
    a new page is created for the context
    the page is navigated to the BASE_URL
    then the page is navigated to the PLP containing the products of the needed category
    10 products are scraped.

    The extracted data is then appended to all the csv files.
    The keywords are appended to the keywords file.

    :param keyword: Keyword to search for
                    Is a string
    """
    file_paths = get_csv_file_paths()
    async with async_playwright() as p:
        browser = await p.firefox.launch(headless=True)
        context = await browser.new_context()
        page = await context.new_page()
        await page.goto(BASE_URL, wait_until='load', timeout=100000)
        target_page = await find_page(page, keyword)
        product_list = await scrape_products(target_page)
        df_current = pd.read_csv(file_paths[0])
        df_current[keyword] = product_list
        df_current.to_csv(file_paths[0], index=False)

        KEYWORDS.append(keyword)
        pd.DataFrame(KEYWORDS, columns=["Keywords"]).to_csv(KEYWORDS_FILE, index=False)

        for filename in (file_paths[1:]):
            try:
                df = pd.read_csv(filename)  # Check if file exists
                df[keyword] = product_list
                df.to_csv(filename, index=False)
            except:
                df_current.to_csv(filename, index=False)
        print("Successfully written products to all files")
        await browser.close()

This scraper is defined to be used when a new keyword is to be added by the user. When a user opts for a new keyword to be added, this scraper is run and it extracts data for that particular keyword only.

The earlier description is intune with this scraper. Only a few more additions have to be made.

  • Here, before the context manager has been started, we obtain all the data files present.

  • The remaining execution works as described except that after collecting the data, changes are made to the data files.

  • The file containing the keywords are updated to add the new keyword.

  • The extracted data is added to the csv file containing latest data and empty columns for the keyword are initialized in the remaining data files.

Finally, the browser is closed to end the scraping session.

Multi-category scraper

async def multi_category_scraper():
    """
    Asynchronous function to scrape the products belonging to multiple categories.

    This function will be executed once every day.
    This is done by checking if the current day's data has already been collected.
    If the current day's data has already been collected, it will return and not run the script.
    If the current day's data has not been collected, it will run the script.

    Begins with creating a new playwright instance
    then a new browser is launched
    here we use firefox
    firefox for playwright is seperately installed using the command - 'python -m playwright install firefox'
    a new context is created for the browser
    a new page is created for the context.

    The page is navigated to the BASE_URL
    then the page is navigated to the PLP containing the products of the needed category
    10 products are scraped.
    The above step is done for all keywords in the KEYWORDS list.

    The extracted data is then appended to all the csv files.
    The file containing the product data 3 days ago is deleted.

    :param keyword: Keyword to search for
                    Is a string
    """
    if check_if_file_exists(CSV_FILE_CURRENT_DAY):
        print("Today's data already collected. Run script tomorrow")
        return
    async with async_playwright() as p:
        browser = await p.firefox.launch()
        context = await browser.new_context()
        page = await context.new_page()
        await page.goto(BASE_URL, wait_until='load', timeout=100000)
        df_current = pd.DataFrame()
        for keyword in KEYWORDS:
            target_page = await find_page(page, keyword)
            product_list = await scrape_products(target_page)
            df_current[keyword] = product_list
        df_current.to_csv(CSV_FILE_CURRENT_DAY, index=False)
        print("Successfully written products to ", CSV_FILE_CURRENT_DAY)
        await browser.close()
    delete_old_files()
    dummy_file_maker()

This is the scraper that is executed each day to extract data for each and every keyword present. This scraper is scheduled to be run each day and doesn’t require the user's input. This scraper is run only once every day. This ensures that the analysis doesn’t become invalid during each new scraping process and that valuable insights can be obtained.

This scarper can be described by making  a few changes to the initial description.

  • Here, before the context manager has been started, we check to see if the current day’s data has already been collected, and if yes then the scraper closes without extracting any new data. If the scraper is run only if the current day’s data has not been already collected.

  • In this scraper after it reaches the Amazon website a loop is initiated. In each loop the data of a separate keyword will be scraped till all the keywords have been finished.

  • In each loop, the scraper searches for the search tab and enters the keyword to be searched for. The  names of 10 products excluding sponsored ones are extracted in order.

  • After exiting the loop the data is written into a csv file and then the browser is closed.

  • After closing the browser any files containing data of more than two days before are deleted.

  • A dummy file maker function is also run. This function is used when the multi-category scraper is run for the first time and not data files exist. This function creates csv files with empty columns headed by keywords.

Finally the scraper session ends.

Now let’s look into the second aspect of our tool - the frontend app.

Bringing insights to life with frontend

The frontend refers to the part of a website that users see and interact with directly. This includes things like the layout, colors, images, text, buttons, and animations. Frontend developers are responsible for creating and maintaining the user interface (UI) of a website using technologies like HTML, CSS, and JavaScript.

However, this also means that to create a website which users can use to interact with our tool considerable knowledge about technologies like HTML, CSS and JavaScript is needed.

This brought the invention of alternatives which didn’t require the users to be very much knowledgeable in these fields. One such alternative is Streamlit which we have used to make the frontend side of our tool.

Importing Libraries

import os
import streamlit as st
import pandas as pd
import time

The frontend has been implemented with the following libraries.

  • Streamlit 

  • Os module

  • Pandas

  • Time module

Streamlit: Building data apps in minutes

Streamlit is an open-source Python library that makes it super easy to build and share beautiful, interactive web apps for data science and machine learning. It is an easy to use library that turns Python code into smooth applications within minutes.

Major benefits of using Streamlit are:

  • No need to learn frontend frameworks like React or Angular. Streamlit uses only Python, making it familiar and accessible for data scientists.

  • Streamlit apps are fast and interactive. Streamlit apps run with one line of code, updating live as we modify our script.

  • Streamlit provides built-in components for charts, graphs, tables, and other visuals, letting us create rich and informative interfaces.

  • We can publish our Streamlit app with a single click and share it with others through a link. No complex server setup required.

  • Streamlit covers everything one needs for data exploration and model deployment, including data loading, pre-processing, visualization, and user interaction.

OS module

As said earlier, the os module in Python is a powerful tool for interacting with the operating system. It can be used to run command line commands.

In our frontend we use the os module to install firefox for playwright by using the following line of code:

‘os.system(“playwright install firefox”)’.

Time module

The time module in Python is a versatile tool for handling various time-related tasks. It offers functions for accessing the current time, measuring execution time, scheduling tasks, and formatting dates.

In our tool we use the time module to pause the program execution for a specified number of seconds using the ‘sleep()’ function.

Main working

Now let’s look at how our frontend user interface is actually implemented. It is to be noted that not only will we see how the user interface has been set up, we will also see how the frontend interacts with the scraper and as well as the scraped data available.

Global Constants

KEYWORDS = get_keywords()
KEYWORDS_FILE = "./data/keywords.csv"
FILE_PATHS = get_csv_file_paths()

The given constants are initialized at the beginning as they are used by various functions. While KEYWORDS and KEYWORDS_FILE have the same use as what we saw in the scraper, FILE_PATHS is the path to each of the data files used in our tool.

Initialization

@st.cache_resource
def initialization():
    run_multi_category_scraper()
    schedule_scraper(run_multi_category_scraper)

The Streamlit app begins by running an initialization function. The function has a cache_resource decorator. Decorators are wrappers that give extra functionality to the functions.

The ‘@st.cache_resource’ decorator in Streamlit is a powerful tool for optimizing our app's performance and efficiency. It allows you to cache the results of expensive computations, reducing the need to re-run them every time a specific part of your app updates or when the app is reloaded.

When the cached function is called for the first time, the result is calculated and stored in a cache. The needed data can then be accessed from the cache.

In our tool we have a cached function called initialization that is needed when the tool is deployed. 

When our tool is deployed it runs the initialization function. This function runs the multi-category scraper and also sets up a scheduler to run the multi-category scraper everyday.

Introductory content for our tool

st.title("Amazon Keyword - Product Rank Tool")
st.write(
    "This is a simple tool to rank products based on a keyword you search for. "
    "The rank is the sequential number of the product in the search results."
)
st.write(
    "The rank of three consecutive days (today, yesterday, day before yesterday) "
    "are calculated, and inferences are made based on this data."
)
st.write("Let's explore the data!")

Our tool has some general description in the beginning of the page before we move into the core functionalities.

This is just a simple description of what our tool is and what it does.

Now let us look into the three main functions of our tool - viewing scraped data in a structured format, adding new keywords for scraping and deleting existing keywords.

Viewing the scraped data

st.header("Enter the keyword you want to search for")
keyword_select = st.selectbox("Select the keyword", KEYWORDS)
keyword_submit = st.button("Submit Keyword")

# if button is clicked
if keyword_submit:
    st.markdown(
        "###### The table shows today's top 10 products of the **{}** category".format(
            keyword_select
        )
    )
    st.write("You can sort the table by clicking on the column headers")

    rank_change_1, rank_change_2 = get_total_rank_difference(keyword_select)
    needed_df = pd.DataFrame(
        {
            "RANK": [number for number in range(1, 11)],
            "NAME": df1[keyword_select],
            "RANK_CHANGE_1": rank_change_1,
            "RANK_CHANGE_2": rank_change_2,
        }
    )
    needed_df = needed_df.set_index("RANK")
    st.dataframe(
        needed_df.style.map(set_color, subset=["RANK_CHANGE_1", "RANK_CHANGE_2"]),
        use_container_width=True,
    )

    # Provide column explanations
    st.markdown(
        """
    **Explanation of Columns:**

    * **RANK:** The product's current rank.
    * **NAME:** The product's name.
    * **RANK_CHANGE_1:**
        - '<span style="color: blue;"> + </span>': Product wasn't present in yesterday's rankings.
        - '<span style="color: green;"> + value </span>': Product gained ranks today.
        - '<span style="color: red;"> - value </span>': Product dropped ranks today.
    * **RANK_CHANGE_2:** Similar to RANK_CHANGE_1, but compares to two days ago.
    """,
        unsafe_allow_html=True,
    )

Our tool can be used to view the scraped data and the way it is ranked and related across three days. This data is displayed in the form of a table which can be sorted, downloaded and even searched through. The data is then analyzed to obtain valuable insights that help the user to understand which of the products make up the higher ranks in the keyword rankings. 

The table shows the rank and name of top 10 products in a particular category and also their rank difference in relation with the previous two days.The table has  a total of four columns.

  • RANK: The product's current rank.

  • NAME: The product’s name.

  • RANK_CHANGE_1: The change in the rank of the product considering the previous day’s data.

  • RANK_CHANGE_2: The change in the rank of the product considering the data two days before.

Both RANK CHANGE columns show data in specific colors and symbols. 

  • If a product from the current day is not found in the previous day, the rank is set to '+' which indicates a new product. This will also have a blue color.

  • Products that have dropped in rankings obtain a negative rank difference, and the ‘-’ sign will be shown in the table. This will have a color of red.

  • We add a '+' prefix to the rank difference if the difference is positive. This will have a color of green.

  • If no rank change is present then a simple 0 will be displayed without any additional colors.

Now let's look at the inner workings of the data displayed.

A streamlit selectbox is present in which any of the existing keywords can be selected to view the data. A selectbox is basically a widget that allows the user to select one option among the many present. 

After selecting a keyword, the user needs to press the submit button. On pressing the submit button, the following functions are performed and then the data is displayed.

  • Some general information on the table is displayed.

  • The data needed for the two RANK CHANGE columns are obtained by using a custom function ‘get_total_rank_difference()’.

  • Now, a dataframe is created having RANK, NAME, RANK_CHANGE_1 and RANK_CHANGE_2 as the four columns.

  • The streamlit dataframe function is then used to display the dataframe as a table. Colors are also set using the ‘set_color()’ custom function. The function is applied to each value of the RANK_CHANGE columns.

  • The info about what the various colors represent are also given below the table.


get_total_rank_difference(keyword)

def get_total_rank_difference(keyword):
    """
    Calculates rank differences for a given keyword across three days.
    This function contains another one called calculate_rank_difference() which calculates rank differences for a single comparison.
    Three dataframes are loaded into memory for each day.
    From the three dataframes we use only the column whose header is the keyword.
    
    :param keyword: The category of products whose rank differences are to be calculated.
                    Is a string.
    :return: Two lists of rank differences. The first list contains the rank differences between today and yesterday, and the second list contains the rank differences between yesterday and day before yesterday.
    """

    file_paths = get_csv_file_paths()
    df1, df2, df3 = pd.read_csv(file_paths[0]), pd.read_csv(file_paths[1]), pd.read_csv(file_paths[2])

    def calculate_rank_difference(current_data, previous_data):
        """
        Calculates rank differences for a single comparison.
        The rank of the current day is set by adding 1 to the index of the current data.
        Because the index of the dataframe starts from 0 and ranks from 1.
        If a product from the current day is not found in the previous day, 
        the rank is set to '+' which indicates a new product.
        While calculating the rank differences, 
        since products that have dropped in rankings obtain a negative rank difference,
        we add a '+' prefix to the rank difference if the difference is positive.

        :param current_data: The data for the current day.
                             Is a List.
        :param previous_data: The data for the previous day.
                              Is a Dictionary.
        :return: A list containing the rank differences of the two days.
        """

        rank_list = []
        for data in current_data:
            current_rank = current_data.index(data) + 1  # 1-based ranking
            previous_rank = previous_data.get(data, 0)  # Use 0 if not found in previous data
            diff = previous_rank - current_rank if previous_rank else '+'  # '+' indicates new product
            if diff != '+':  # Format numerical differences
                diff = str(diff) if diff <= 0 else '+' + str(diff)  # Prefix positive changes with '+'
            rank_list.append(diff)
        return rank_list

    current_data = list(df1[keyword])
    previous_data = dict(zip(df2[keyword], df2.index + 1))  # Map product names to ranks in previous data
    previous_data_2 = dict(zip(df3[keyword], df3.index + 1))

    previous_rank_list = calculate_rank_difference(current_data, previous_data)
    previous_rank_list_2 = calculate_rank_difference(current_data, previous_data_2)

    return previous_rank_list, previous_rank_list_2
  • The function reads the data stored in various csv into various dataframes. 

  • Then the column containing product rankings for the specified keyword is extracted from each DataFrame.

  • The function then uses a helper function called ‘calculate_rank_difference’ to compare rankings between days: 

  • Pairs up current day's rankings with the previous day's rankings. 

  • Subtracts the previous rank from the current rank to find the difference. 

  • Marks new products with a '+' since they didn't have a rank before. 

  • Puts a '+' in front of positive differences to highlight improvements in ranking.

set_color(value)

def set_color(value):
    """
    Assigns colors for rank difference values.
    
    :param value: The rank difference value.
                  Is a string.
                  Can be '+' for new products, 
                  has '-' prefix for rank drops, 
                  has '+' prefix for rank improvements, 
                  or '0' for no change.
    :return: A CSS color string.
             Colors are assigned to the rank difference values:
             - New product: blue
             - Rank dropped: red
             - Rank improved: green
             - No change: white
    """

    if value == '+':
        color = 'blue'  # New product
    elif value[0] == '-':
        color = 'red'  # Rank dropped
    elif value[0] == '+':
        color = 'green'  # Rank improved
    else:
        color = 'white'  # No change
    return 'color: %s' % color  # Return CSS color string
  • The function is used to visually highlight changes in product rankings by assigning different colors to rank difference values. 

  • The function makes it easier to spot trends and changes at a glance.

  • New products are marked with a '+', colored blue to indicate their recent addition. 

  • Rank drops have a '-' prefix, colored red to highlight a decline in ranking. 

  • Rank improvements have a '+' prefix (excluding new products), colored green to signal a positive change.

  •  When there is no difference in rank, a value of '0' is present which is colored white to blend in and not draw attention.

  • The function constructs a CSS color property string (e.g., "color: red") that can be directly applied to text or elements in a web page or application.

Adding new keywords 

with st.expander("Can't find what you're looking for?"):
    st.subheader("Suggest a new keyword to expand our search capabilities.")

    # Get user input for new keyword
    suggested_keyword = st.text_input("Enter the keyword you'd like to suggest:")
    submit_suggestion_button = st.button("Submit Suggestion")

    if submit_suggestion_button and suggested_keyword:
        suggested_keyword_lower = suggested_keyword.lower()
        # Check if keyword is already present
        if suggested_keyword_lower in [keyword.lower() for keyword in KEYWORDS]:
            with st.empty():
                st.error(
                    "The keyword you entered is already in our list.\n"
                    "Please enter a different keyword."
                )
            time.sleep(3)  # Delay for error message visibility
        else:
            with st.empty():
                st.write("Thank you for your suggestion! We'll review it shortly.")
            time.sleep(1)  # Delay for message visibility
            info_container = st.empty()
            info_container.info(
                "Please note:\n"
                "- We'll notify you when the new keyword is added.\n"
                "- When a new keyword is added, only the rank of products available today will be shown.\n"
                "- For obtaining rank difference across three days, wait for two days to pass."
            )

            # Run the scraper for the suggested keyword
            new_keyword = capitalize_first_word(suggested_keyword_lower)
            run_single_category_scraper(new_keyword)

            info_container.empty()
            # Notify user of successful addition
            with st.empty():
                st.info("The new keyword has been added.")
            time.sleep(3)  # Delay for success message visibility
            st.rerun()

When a keyword that a user wants doesn’t exist, they can add them. This is a feature that helps users in cases where a particular product that they want doesn’t exist within our keyword list.

This function is present within a streamlit expander. Where an expander is similar to a button that hides its content till it is pressed.

On clicking the expander it asks for the new keyword from the user. A text input space is present for the user to add the keyword. After entering the new keyword, the user has to click a submit button. On clicking the submit button, the following happens:

  • It is checked to see if the newly entered keyword already exists and if yes then an error message is displayed.

  • If the keyword doesn’t already exist then some basic info is shown and the single category scraper is run.

  • After the scraper finishes execution, a message telling the user that a  new keyword has been added is sent.


Deleting a keyword

with st.expander("Delete an existing keyword?", expanded=False):
    st.subheader("Delete a keyword from our list.")
    keyword_for_deletion = st.selectbox(
        "Select the keyword to delete", KEYWORDS, key="delete_keyword"
    )
    if keyword_for_deletion not in df1.columns:
        st.rerun()
    warning_container = st.empty()
    warning_container.warning(
        """
        Make sure you want to delete the keyword !!!\n
        This action cannot be undone.
        """
    )
    # Handle deletion if confirmed
    if st.button("Submit Deletion"):
        delete_keyword(keyword_for_deletion)
        warning_container.empty()
        st.info("The keyword has been deleted.")
        time.sleep(1)  # Delay for success message visibility
        st.rerun()

If there exists a keyword that the user has no need of, or if a keyword was accidentally added, then it can be deleted. This feature helps users to keep the keyword list  one that is of use to them and ensures that no useless keywords are present.

This function is also present within a streamlit expander.

On clicking the expander, the user is shown a warning about deleting the keyword, it then asks the user to select the keyword to delete from a selectbox. After selecting the keyword to be deleted, the user has to click a submit button. On clicking the submit button, a custom ‘delete_keyword()’ function is executed which deletes the keyword from all the data files. After this a message displaying successful keyword deletion is shown.

delete_keyword()

def delete_keyword(keyword):
    """
    Deletes a keyword from the keywords file and all the CSV files in the data folder.

    :param keyword: The keyword to be deleted.
                    Is a string.
    :return: None
    """
    keyword_list = get_keywords()
    keyword_list.remove(keyword)
    df = pd.DataFrame()
    df['Keywords'] = keyword_list
    df.to_csv(KEYWORDS_FILE, index=False)

    data_files = get_csv_file_paths()
    for data_file in data_files:
        df1 = pd.read_csv(data_file)
        df1 = df1.drop(keyword, axis=1)
        df1.to_csv(data_file, index=False)
  • The keyword is at first removed from the csv file containing the list of keywords.

  • The function then goes through each data file and deletes the column headed by the keyword.

Optimizing the tool for large-scale operation

Now that we have seen what our tool does and how it does it, let us look at how large of a scale we can use it and how to scale it for further increasing use.

Our tool works almost perfectly when we use a limited number of keywords. But this can change as the number of keywords expand by a great number.

The main issue with the increase in the number of keywords is that it increases the amount of time it takes to scrape the data for each keyword. This can cause quite some issue and can maybe even crash our tool.

An easy way to solve this issue is to divide the whole number of keywords into sets of smaller numbers and run multiple scrapers in parallel or with  a needed time pause.

Another issue and one of the major concerns is when the number of users of our tool increases. This can cause a variety of issues.

  • The number of keywords can quickly increase.

  • Most keywords present would be useless to users when taken each user at a time.

  • Some users may delete keywords needed by some other users.

All of these issues can make a lot of issues to users and make them even stop using the tool as its functionality causes more harm than advantages.

The best way to solve the increasing number of users is to add a database to take each user into account. The database can be used such that one table contains all the keywords and one table contains all the users. Now a third table will be present that shows which all keywords that a user needs. 

The users can thus see and access only those keywords that are connected to them in the third table. So when a new keyword is added by a user it is added to the third table and added to the second if and only if it doesn’t already exist.

When a user deletes a keyword, it is removed from the user-keyword relation in the third table and is removed from the second table if and only if the keyword is needed by no other users.

The third table ensures that the keywords are added and removed efficiently. The second table ensures that scraping is done for a particular keyword once each day.

Another feature that can increase the efficiency of our tool is to add the features of a keyword tool. Keyword tools are softwares that helps users to search and identify the relevant keywords that have the most effect on the platform it is to be searched on. Keyword tools save time and effort in choosing the right keywords.

Wrapping Up and What's Next!

While Amazon Tool offers a wealth of benefits for businesses, we understand that some may require additional solutions tailored to their unique needs. If you find yourself seeking a more personalized approach or require access to raw, unlimited data with the freshest updates, don't hesitate to reach out to us. Our team is dedicated to providing you with the precise insights and support necessary to optimize your business strategies effectively. Let us empower your business with the data-driven solutions you deserve.

So go ahead, dive into the world of Amazon with confidence, armed with your trusty Keyword Rank Tool. And remember, we the Datahut team is always here to help you take your shopping experience to the next level. Happy tracking, everyone!

74 views0 comments

Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?

bottom of page