I hit a wall last month. After spending hours crafting the perfect search queries for PubMed and Web of Science, I exported everything to Zotero only to realize I had a new problem - figuring out which papers actually mattered for my coffee drying research from over 600 academic papers.

I started the usual way - open abstract, read, decide, repeat. After screening about 20 papers, a pattern emerged. I was seeing lots of papers about coffee roasting models (interesting, but wrong process), coffee ground drying (post-processing, not what I needed), and even mathematical models of coffee stain patterns, coffee droplets (Mathematically interesting but not the kind of thing to include in my investigation).

Those first 20 papers were actually valuable - not for their content, but for showing me exactly what I needed to exclude. Coffee roasting had different keywords and temperature ranges. Coffee ground studies focused on spent coffee waste. and some works focused on the mathematical model for generating bio-gas from coffee waste. I was looking for fresh coffee beans only, specifically their drying process.

By the tenth paper about coffee's plant diseases (not what I needed) and another one about coffee shop storage models (definitely not what I needed), I was losing my mind. All I wanted were papers about drying coffee beans that included some kind of mathematical modeling.

Late one night, staring at my screen full of unscreened papers, I had an idea. Why not write some code to help me spot the irrelevant ones?

It started simple enough - just a Python script to look for keywords. But then I kept thinking "what if..." What if I could use natural language processing to better understand the abstracts? What if I could automatically detect papers that talked about mathematical models? Four cups of coffee later (the irony wasn't lost on me), I had something working.

The code wasn't fancy at first. Just spaCy for processing the text, some regex patterns to catch mathematical terms, and scikit-learn to help group similar papers together. I focused on three things:

Is this about drying?
Is this about coffee beans specifically?
Does it mention any kind of mathematical model or analysis?

Of course, things went wrong. My first run flagged everything with the word "dry" - including papers about dried coffee grounds in cosmetics and that persistent coffee stain paper that kept showing up. Back to the code I went. Each time something slipped through, I refined the patterns. "Moisture reduction" needed to be caught just like "drying." "Green coffee" meant the same as "coffee beans."

The best part? Once I got it working, screening became less of a chore. Each paper got a simple yes/no, a note explaining why, and for the good ones, details about their mathematical approach. My Zotero library started making sense again.

I'm still tweaking things. Sometimes the code misses papers that use unusual terminology, and occasionally it gets too excited about anything with numbers. But compared to where I started - drowning in irrelevant papers - it's a huge improvement.

For anyone curious, I've attached the code from my GitHub I ended up with at the end. It's not perfect, but it saved my sanity during this review. Maybe it can help someone else too.

Make This Assistant Your Own

If you've made it this far, you might be thinking "This sounds great, but my research isn't about coffee." Here's the beauty of it - you can adapt this screening assistant for any topic. Let me show you how the code works and what you need to change.

Understanding the NLP Magic

The natural language processing part of the code does three main things:

Text Processing: Using spaCy, it breaks down each abstract into tokens (individual words) while removing common words like "the" or "and" that don't add meaning. This helps focus on the important terms.
Relevance Checking: The code looks for three types of keywords:

python code:

drying_keywords = ["drying", "dehydration", "moisture"]
coffee_keywords = ["coffee", "beans", "coffee beans"]
math_keywords = ["model", "statistical", "regression", "mathematical", "analysis", "simulation"]

Pattern Matching: The code uses regular expressions to catch more complex phrases:

python code:

math_patterns = r"(regression|mathematical model|statistical model|simulation|equation|analysis)"

You can modify this to catch specific methodologies in your field.

Customizing the Assistant

To adapt this for your research: Define Your Core Concepts:

What are the main themes you're looking for?

In my case, it was drying processes, coffee beans, and mathematical modeling. For you, it might be different treatments, materials, or methods. List Your Keywords: For each core concept, create a list of related terms. Include synonyms and alternative phrasings. Remember how I had to add "moisture reduction" because not everyone says "drying"? Think about similar variations in your field. Update the Analysis Function: The analyze_abstract() function combines these elements to make decisions. You might need to adjust the logic - maybe you want papers that mention either of two methods, rather than requiring both.

python code:

def analyze_abstract(abstract):
    doc = nlp(abstract.lower())
    tokens = [token.text for token in doc if not token.is_stop and token.is_alpha]
    
    # Change these conditions based on your needs
    is_relevant_process = any(word in tokens for word in process_keywords)
    is_relevant_material = any(word in tokens for word in material_keywords)
    is_relevant_method = any(word in tokens for word in method_keywords)

One Last Tip: Start with a small test set of papers you've already screened manually. Run your modified assistant on these and see how it performs. This will help you catch any terms you might have missed or remove ones that are too broad. Remember, the goal isn't to replace your judgment - it's to help you focus it on the papers that most likely matter for your research.

Feel free to modify and run my code on your Jupyter notebook from my GitHub

Rechercher dans ce blog

Code n Curves

Tired of Manual Screening for your scoping review? Here's How I Built My Digital Assistant

Make This Assistant Your Own

Understanding the NLP Magic

Commentaires

Enregistrer un commentaire

Posts les plus consultés de ce blog

What went wrong in Montreal? Urban expansion and Montreal’s Rent Crisis