Using Open Source NLP Models and LLMs with Python

Introduction

This blog post will give a first glance on how to use open source NLP models and large language models (LLMs) with Python.

Decoder and Transformer Style Models

Transformer models have revolutionized natural language processing (NLP) by enabling models to handle sequential data more efficiently than traditional architectures. Introduced in the seminal paper Attention Is All You Need by Vaswani et al. in 2017, transformers rely on a mechanism called self-attention to process input data.

What Are Transformer Models?

Transformers are neural network architectures designed to handle sequences of data by weighing the relevance of different elements in the input. This is achieved through the self-attention mechanism, which allows the model to consider the entire input sequence simultaneously, capturing long-range dependencies more effectively.

Key Components:

Encoder: Processes the input sequence to generate a contextualized representation.
Decoder: Generates the output sequence by attending to the encoder’s output and previously generated tokens.

Encoder and Decoder Models

Transformer models can be categorized based on their architecture:

Encoder-Only Models

Purpose: Specialize in understanding and representing input data.
Examples: BERT, RoBERTa.
Use Cases: Text classification, sentiment analysis, named entity recognition.

Decoder-Only Models

Purpose: Focus on generating text by predicting the next word in a sequence.
Examples: GPT series (GPT-2, GPT-3).
Use Cases: Text generation, language modeling, conversational AI.

Encoder-Decoder Models

Purpose: Combine both encoding and decoding for sequence-to-sequence tasks.
Examples: Original Transformer model, T5, BART.
Use Cases: Machine translation, text summarization, question answering.

Why Are They Important?

Transformer models offer several advantages:

Parallelization: They process input data in parallel, leading to faster training and inference compared to sequential models like RNNs.
Handling Long Sequences: Self-attention allows them to capture relationships between distant elements in the input sequence.
Versatility: Applicable to a wide range of NLP tasks due to their flexible architecture.

By understanding the roles of encoders and decoders within transformer models, you can select the appropriate architecture for your NLP tasks, leveraging the power of modern language models to achieve state-of-the-art results.

Open Source LLMs with Hugging Face

Hugging Face is a leading open-source platform that provides tools and resources for natural language processing (NLP) and machine learning. It hosts a vast collection of pre-trained models, datasets, and libraries that make it easier for developers and researchers to leverage state-of-the-art models for various NLP tasks.

At its core, Hugging Face is a community-driven hub that facilitates the sharing and use of machine learning models. The platform offers:

Transformers Library: A popular library that provides APIs and tools to easily download and use pre-trained models for tasks like text classification, translation, summarization, question answering, and more.
Model Hub: A repository where developers can share and access thousands of pre-trained models contributed by the community.
Datasets Library: A collection of ready-to-use datasets for NLP tasks.
Tokenizers Library: Efficient implementations of tokenizers, which are essential for preparing text data for models.

Common Open-Source LLMs on Hugging Face

Several open-source large language models (LLMs) are available on Hugging Face. Here are some of the most popular ones:

1. GPT-Neo and GPT-NeoX

Developed by: EleutherAI
Architecture: Decoder-Only Transformer Models
Description: GPT-Neo models aim to replicate the performance of OpenAI’s GPT-3 models. They come in various sizes, including 1.3B, 2.7B, and GPT-NeoX-20B parameters.
Use Cases: Text generation, completion, and creative writing tasks.
Example: EleutherAI/gpt-neo-2.7B

2. BLOOM

Developed by: BigScience
Architecture: Decoder-Only Transformer Model
Description: BLOOM is a multilingual LLM with 176 billion parameters, capable of generating text in 46 languages and 13 programming languages.
Use Cases: Multilingual text generation, translation, and research.
Example: bigscience/bloom

3. Falcon LLM

Developed by: Technology Innovation Institute (TII)
Architecture: Decoder-Only Transformer Models
Description: Falcon models are known for their efficiency and high performance, available in Falcon-7B and Falcon-40B variants.
Use Cases: General-purpose text generation, coding assistance, and conversational AI.
Example: tiiuae/falcon-7b

4. LLaMA

Developed by: Meta AI
Architecture: Decoder-Only Transformer Models
Description: LLaMA models range from 7B to 65B parameters and are designed for research purposes.
Use Cases: Research in LLMs, downstream NLP tasks after fine-tuning.
Note: Access may require approval from Meta, and the models are intended for non-commercial use.

5. MPT (MosaicML Pretrained Transformer)

Developed by: MosaicML
Architecture: Decoder-Only Transformer Models
Description: MPT models are optimized for efficient training and inference, with configurations like MPT-7B and MPT-30B.
Use Cases: Text generation, summarization, and serving as a base for fine-tuning on specific tasks.
Example: mosaicml/mpt-7b

6. Flan-T5

Developed by: Google Research
Architecture: Encoder-Decoder Transformer Models
Description: Flan-T5 models are fine-tuned versions of T5, trained on a mixture of tasks to enhance their instruction-following capabilities.
Use Cases: Translation, summarization, question answering, and more.
Example: google/flan-t5-large

7. OpenLLaMA

Developed by: OpenLM Research
Architecture: Decoder-Only Transformer Models
Description: An open-source reproduction of Meta’s LLaMA models, trained from scratch on publicly available data.
Use Cases: Research and applications in LLMs, with fewer restrictions compared to the original LLaMA.
Example: openlm-research/open_llama_7b

Testing Sentiment Analysis with Hugging Face and Python

In this section, we’ll walk through setting up a sentiment analysis pipeline using Hugging Face’s Transformers library in Python. We’ll cover installing the necessary packages, explain each step of the code, and demonstrate how to run a simple sentiment analysis.

Installing the Packages

First, ensure you have Python installed on your system. Hugging Face’s Transformers library requires Python 3.6 or higher. You can check your Python version by running:

python --version

We need to install transformers and torch (PyTorch) or tensorflow:

pip install transformers
pip install torch

This can also of couse be installed with any virtual environment tool such as poetry.

Example Code

The default sentiment analysis model from the transformrs library is distilbert-base-uncased-finetuned-sst-2-english. This model is trained on the Stanford Sentiment Treebank v2 (SST-2) dataset. This dataset contains only two sentiment labels: Positive and Negative. There is no neutral class in this dataset. Therefore, I will use the cardiffnlp/twitter-roberta-base-sentiment-latest model, which is a roBERTa-base model trained on ~124M tweets (between 2018 to 2021) and finetuned for sentiment analysis with the TweetEval benchmark.

the model is trained to classify any input as either positive or negative.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np
from scipy.special import softmax

# Preprocess text (username and link placeholders; this is how the model was trained)
def preprocess(text):
    new_text = []
    for t in text.split():
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

# Load the tokenizer and model
model_name = 'cardiffnlp/twitter-roberta-base-sentiment-latest'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Labels mapping
labels = ['Negative', 'Neutral', 'Positive']

def analyze_sentiment(text):
    # Preprocess the text
    text = preprocess(text)
    # Encode the text
    encoded_input = tokenizer(text, return_tensors='pt')
    # Get model outputs
    with torch.no_grad():
        output = model(**encoded_input)
    # Convert logits to probabilities
    scores = output.logits[0].numpy()
    scores = softmax(scores)
    # Get the label with the highest score
    ranking = np.argsort(scores)
    ranking = ranking[::-1]
    top_label = labels[ranking[0]]
    top_score = scores[ranking[0]]
    # Return the result
    return {
        'label': top_label,
        'score': float(top_score),
        'scores': {labels[i]: float(scores[i]) for i in range(len(scores))}
    }

# Test the function
proposal_text = "I am thrilled with the new updates to our DAO governance model!"
sentiment = analyze_sentiment(proposal_text)
print(f"Text: {proposal_text}")
print(f"Sentiment: {sentiment['label']}, Score: {sentiment['score']:.4f}")
print("All scores:", sentiment['scores'])

Below are some examples where the model successfully classifies a positive, a negative, and a neutral text.

Text: I am thrilled with the new updates to our DAO governance model!
Sentiment: Positive, Score: 0.9901
All scores: {'Negative': 0.0016886675730347633, 'Neutral': 0.008168383501470089, 'Positive': 0.9901429414749146}

Text: I am not thrilled with the new updates to our DAO governance model!
Sentiment: Negative, Score: 0.9132
All scores: {'Negative': 0.9132303595542908, 'Neutral': 0.07846872508525848, 'Positive': 0.008300881832838058}

Text: My name is Magnus
Sentiment: Neutral, Score: 0.8236
All scores: {'Negative': 0.03530899062752724, 'Neutral': 0.8236108422279358, 'Positive': 0.14108018577098846}

Citing the above sentiment analysis model model

When using these models in papers, it is important to correctly cite the work that you are using, to give credit where it’s due. Citing information can be found on the project page.

@inproceedings{loureiro-etal-2022-timelms,
    title = "{T}ime{LM}s: Diachronic Language Models from {T}witter",
    author = "Loureiro, Daniel  and
      Barbieri, Francesco  and
      Neves, Leonardo  and
      Espinosa Anke, Luis  and
      Camacho-collados, Jose",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-demo.25",
    doi = "10.18653/v1/2022.acl-demo.25",
    pages = "251--260"
}

Testing to Evaluate a CV with an LLM

In this section, we’ll explore how to use Large Language Models (LLMs) to evaluate a CV. We’ll start by the process of extracting text from a PDF CV, feeding it into an LLM, and interpreting the results.

There are a few open source libraries for Python such as PyMuPDF, PDFplumber, and PyPDF1. There are also paid options such as Azure Document Intelligence. In this blog post, we will use pymupdf4llm. This package converts the pages of a PDF to text in Markdown format using PyMuPDF, standard text and tables are detected, brought in the right reading sequence and then together converted to GitHub-compatible Markdown text.

The choice of language model is of course very important. For this example, we will use the google/flan-t5-base and run it on a NVIDIA GeForce GTX 1060 6GB (an old and not particularly powerful GPU).

Example Code

pip install pymupdf4llm
pip install torch

import pymupdf4llm
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Import CV as Pdf and transform it to Str
cv_file = "/home/magnus/Git/HanssonMagnus/research/latex/cv/professional/cv_magnus.pdf"
md_text = pymupdf4llm.to_markdown(cv_file)

# Evaluate the CV with an LLM
def evaluate_cv_with_hf(cv_text):
    """
    Uses an open-source LLM to evaluate a CV.

    Args:
        cv_text (str): The text content of the CV.

    Returns:
        str: The evaluation response from the model.
    """
    model_name = 'google/flan-t5-base'
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

    # Use GPU
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)

    prompt = (
        "You are a professional career advisor. Please evaluate the following CV:\n\n"
        f"{cv_text}\n\n"
    )

    inputs = tokenizer(prompt, return_tensors='pt', truncation=True,
                       max_length=512).to(device)
    outputs = model.generate(
        **inputs,
        max_length=512,
        num_beams=5,
        early_stopping=True
    )
    evaluation = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return evaluation

evaluation = evaluate_cv_with_hf(md_text)
print("CV Evaluation:\n")
print(evaluation)

The following is the result:

CV Evaluation:

**WORK EXPERIENCE** **Stockholm Business School and Swedish House of Finance** August
2023 - Ongoing _Assistant Professor (Quantitative Finance)_ _Stockholm, Sweden_
Quantitative finance, market microstructure, and decentralized finance. __ **Independent
Consultant** September 2023 - Ongoing _Data Scientist_ _Remote_ Risk modelling and data
science. __ **Combine Control Systems** September 2017 - August 2018 _Data Scientist_
_Gothenburg, Sweden_ Artificial neural network for virtual engine testing. __ **Nordea
Bank** 2012, 2013, 2014 _Summer Analyst_ _Gothenburg, Sweden_ Summer analyst in
corporate retail. __ **EDUCATION** **University of Gothenburg** September 2018 - June
2023 _PhD Economics (Quantitative Finance)_ _Gothenburg, Sweden_ Thesis: “Decentralized
Finance and Central Bank Communication”. __ Visiting PhD student at Stockholm School of
Economics. __ **Lund University** September 2016 - June 2017 _MSc Financial Economics_
_Lund, Sweden_ Thesis: “On Stock Return Prediction with LSTM Networks”. __ **Lund
University** September 2011 - June 2014 _BSc Mathematics_ _Lund, Sweden_ Exchange
semester at University of St.Gallen.

It is noteworthy how “bad” this result is compared to more advanced models. The exact same promp input gives the following result from ChatGPT 4o:

# CV Evaluation

## Strengths:
1. **Concise and Clear**: The CV highlights your background in quantitative finance,
   data science, and decentralized finance concisely. The focus on methods and
programming is strong.
2. **Relevant Experiences**: Your roles in academia, industry, and consulting show
   versatility and expertise in both research and practical applications.
3. **Clear Technical Competence**: Listing programming languages and tools clearly
   conveys your proficiency. It's good that you highlight Linux, Git, Vim, and Ethereum
nodes, which shows depth in technical knowledge.
4. **PhD and Research**: Your education section clearly outlines advanced academic
   achievements, which is a major strength, especially for research or quant-focused
roles.

## Areas for Improvement:
1. **Formatting**: The CV could benefit from more consistent formatting. For example,
   bullet points or detailed descriptions under each role would make it easier to scan
for specific responsibilities or achievements. Right now, the CV reads more like a
timeline.

2. **Achievements and Impact**: For each role, especially in consultancy and industry,
   add concrete accomplishments or measurable impacts, e.g., "Developed and implemented
a risk model that reduced X by Y%." This will help quantify your contributions and give
the reader a sense of your practical results.

3. **Work Experience Details**: The 'Assistant Professor' and 'Independent Consultant'
   roles could be expanded. Describe key projects, teaching focus, or consulting
deliverables, especially if they are relevant to future career moves (e.g., towards
quantitative finance roles).

4. **Education Clarity**: The education section is strong, but you could mention any
   specific coursework or collaborations, particularly during your PhD or visiting
roles, if they involved notable researchers or institutions.

5. **Tools & Methods Section**: This section could be broken down more clearly. Consider
   categorizing into **Languages**, **Methods**, and **Tools** instead of a long list.
Additionally, you might want to include some context for how you've applied these skills
in your work.

6. **Professional Website**: Including your website is good, but ensure that it is
   up-to-date and relevant. It should showcase any publications, projects, or further
details about your work.

## Suggestions for Changes:
- **Assistant Professor**: Add a short bullet list of 2-3 key projects, lectures, or
research contributions. For example: "Developed curriculum on decentralized finance,
focusing on market microstructure."

- **Independent Consultant**: Provide more detail on your consulting work, especially if
it involves specific tools, companies, or data science projects that can show industry
relevance.

- **Combine Control Systems**: Describe the type of neural networks used, the impact on
the virtual engine testing process, and any collaborative aspects of the work.

---

In sum, emphasize achievements and measurable results, improve formatting consistency,
and make the experience section more detailed to highlight your impact in each role.

Thus, for any project it is important to evaluate what you need and what level of sophistication you require from the LLM for your project.

However, to run a model like GPT-4 locally, you’d likely need 350GB+ of GPU memory, which could be achieved by using multiple high-end GPUs, such as the NVIDIA A100 (40GB or 80GB versions). For smaller or quantized versions, 40-80GB of GPU memory might suffice, which could be handled by 1-2 GPUs like the NVIDIA A100 or RTX 3090. A full setup for GPT-4’s largest variant could easily exceed $50,000 depending on the number of GPUs needed.