This paper brings to light how network embedding graphs can help to solve major open problems in natural language understanding. One illustrative element in NLP could be language variability - or ambiguity problem - which happens when two sentences express the same meaning (or ideas) with very different words. For instance we may say almost interchangeably: “where is the nearest sushi restaurant?” or “can you please give me addresses of sushi places nearby?”. These two sentences exactly share the same meaning with a different semantic wording. Here is the big challenge that we are struggling. In terms of data science, it bears witness of a well-known problem called text similarity. Indeed, my sparse vectors for the 2 sentences have no common words and consequently will have a cosine distance of 1. This is a terrible distance score because the 2 sentences have very similar meanings.
The first thing that is crossing any data scientists’mind would have been to use popular document embedding methods based on similarity measures such as Doc2Vec, Average w2v vectors, Weighted average w2v vectors (e.g. tf-idf), RNN-based embeddings (e.g. deep LSTM networks), … to cope with this text similarity challenge.
As for us, we will tackle this text similarity challenge by implementing network graph embeddings in light of traditional word embeddings technics.
We will play from two pieces of information of an unique dataset:
We have news - written in English according many topics. This is a text format, that is to say a series of paragraphs, sentences and words. Here is the micro level. We will implement graph of words. Consequently, we will not consider the context of news in order to emphasise words themselves.
In addition, we have information regarding the context of news - when they are published, where they come from, from which category they fall down, who write them, … Here is a macro level. We will implement graph of context. Here, we will not consider words.
The objectives is to play with these two levels in order to classify news at best.
In very basic terms, word embeddings turns corpus text into numerical vectors. Consequently two different words - sharing in common a same semantic similarity - are close in term of Euclidean distance into a given high dimensional space. Words that have the same meaning have a similar representation - or a very close numerical vectors.
It has been shown (through the pionneering paper of Abhik Jana and Pawan Goyal on “Can Network Embedding of Distributional Thesaurus be Combined with Word Vectors for Better Representation?”) in which they explained that “Learning word representations is one of the basic and primary steps in understanding text and there are predominantly two views of learning word representations.”
“In one realm of representation,words are vectors of distributions obtained from analyzing their contexts in the text and two words are considered meaningfully similar if the vectors of those words are close in the euclidean space. Here are dense representation of words such as predictive model like Word2vec (Mikolov et al, 2013) or count-based model like GloVe (Pennington et al., 2014)”.
Otherwise, “talks about network like structure where two words are considered neighbors if they both occur in the same context above a certain number of times. The words are finally represented using these neighbors. Distributional Thesaurus network is one such instance of this type, the notion of which was used in early work about distributional semantics (Grefenstette, 2012; Lin, 1998; Curran and Moens, 2002).”
The goal of this paper is to turn “a Distributional Thesaurus (DT) network into word embeddings by applying efficient network embedding methods and analyze how these embeddings generated from DT network can improve the representations generated from prediction-based model like Word2vec or dense count based semantic model like GloVe. We experiment with several combination techniques and find that DT network embedding can be combined with Word2vec and GloVe to outperform the performances when used independently.”
Representation learning is when we use machine learning techniques to derive data representation
Distributed representation is different from one-hot representation because it uses dense vectors to represent data points
Embedding is mapping information entities into a low-dimensional space
We are born with the intention of implementing some Natural Language Processing (NLP) within Graph Databases and Neo4j. First, few words when it comes to Neo4j which is a graph database management system developed by Neo4j. The underlying concept is very simple: everything is stored in the form of either an edge, a node, or an attribute. Each node and edge can have any number of attributes. Both the nodes and edges can be labelled.
Let’s consider two sentences:
\(S_{1}\) = Where is the nearest sushi restaurant?
\(S_{2}\) = Can you please give me addresses of sushi places nearby?
\(S_{1}\) = {“Where”,“nearest”,“sushi”,“restaurant”}
\(S_{2}\) = {“give”,“addresses”,“sushi”,“places”,“nearby”}
left(“sushi”) = {“where”,“nearest”,“give”,“addresses”}
right(“sushi”) = {“restaurant”,“places”,“nearby”}
Here is the Cypher Query to put our two sentences into neo4j graph database:
WITH split(tolower("Where nearest sushi restaurant", "") AS text
UNWIND range(0, size(text)-2) AS i
MERGE (w1:Word {name: text[i]})
MERGE (w2: Word {name: text[i+1]})
MERGE (w1)-[:NEXT]-> (w2)
Let’s consider two news sentences:
\(S_{1}\) = My boss eats sushi on Friday
\(S_{2}\) = My brother eats pizza on Sunday evening
Sim("boss","brother") =
Sim(left("boss"), right("boss")) +
Sim(left("brother"), right("brother"))
Let’s consider https://newsapi.org/ which is a simple and easy-to-use API that returns JSON metadata for headlines and articles live all over the web right now. News API indexes articles from over 30,000 worldwide sources. To be honest, I am highly interested in text - that is to say the variable of headlines.
But let’s keep other columns to play with graph and maybe find some interesting things…
author | description | publishedAt | source | title |
---|---|---|---|---|
http://www.abc.net.au/news/lisa-millar/166890 | In the month following Donald Trump’s inauguration it’s clear that Russians are no longer jumping down the aisles. | 2017-02-26T08:08:20Z | abc-news-au | Has Russia changed its tone towards Donald Trump? |
http://www.abc.net.au/news/emily-sakzewski/7680548 | A fasting diet could reverse diabetes and repair the pancreas, US researches discover. | 2017-02-26T04:39:24Z | abc-news-au | Fasting diet ‘could reverse diabetes and regenerate pancreas’ |
http://www.abc.net.au/news/jackson-vernon/7531870 | Researchers discover what could be one of the worst cases of mine pollution in the world in the heart of New South Wales’ pristine heritage-listed Blue Mountains. | 2017-02-26T02:02:28Z | abc-news-au | Mine pollution turning Blue Mountains river into ‘waste disposal’ |
http://www.abc.net.au/news/sophie-mcneill/4516794 | Yemen is now classified as the world’s worst humanitarian disaster but Australia has committed no funding to help save lives there. | 2017-02-26T09:56:12Z | abc-news-au | Australia ignores unfolding humanitarian catastrophe in Yemen |
http://www.abc.net.au/news/dan-conifer/5189074, http://www.abc.net.au/news/6815894 | Malcolm Turnbull and Joko Widodo hold talks in Sydney, reviving cooperation halted after the discovery of insulting posters at a military base, and reaching deals on trade and a new consulate in east Java. | 2017-02-26T03:43:04Z | abc-news-au | Australia and Indonesia agree to fully restore military ties |
Ron Amadeo | If this is how BlackBerry wants to do hardware, we really won’t miss them. | 2017-02-25T21:00:08Z | ars-technica | BlackBerry KeyOne Hands On—BlackBerry wants $549 for mid-range device |
Roheeni Saxena | States that legalized gay marriage early created a natural experiment. | 2017-02-25T20:00:37Z | ars-technica | Same-sex marriage linked to decline in teen suicides |
Roheeni Saxena | We may finally be getting somewhere in our fight against the disease. | 2017-02-25T19:00:16Z | ars-technica | New malaria vaccine is fully effective in very small clinical trial |
Let’s switch from a linear and static SQL dataframe to a dynamic NoSQL database - i.e. from relational to non-relational database.
As you can spot, there are 9 variables within our dataframe coming from newsAPI. The job is to transform individual column into a relational data model. Let’s take an instance to make a good start.
“author” variable is turning into red pen logo inside our new NoSQL database
“description” variable is turning into orange comments logo inside our new NoSQL database
“publishedAt” variableis turning into green clock logo inside our new NoSQL database
“source” variable is turning into yellow compass logo inside our new NoSQL database
“category” variable is turning into black compass logo inside our new NoSQL database
“title” variable is turning into blue folder logo inside our new NoSQL database
It should not be forgotten that the neo4j graph just below is just an isolated implementation of a given piece of information. Now the time has come to repeat a command looping over each news. In order to industrialise the building processes of our neo4j database, we are going to implement it on https://neo4j.com/
This big step will be the main topic of paper 2 [network_work_word_embeddings_part_2.html]
To give a first hint of what is neo4j, here is home screen of my database in which all news will be stocked…
Just to provide some context of why we undertook this step - let’s consider a concrete example to highlight how important links are to retrieve quickly and efficiently information as well as bringing to light unseeable links which stand for within a static and linear database.
So here we just drew some links between different news when they share some features in common such as author of the article, where news were coming from (BBC, Eurosport, …), when news were published, from which category news fell, etc… In one word, the first thing to do is to partition/segment news according to features they jointly share. But here is not the final objective at all. We want a graph of words what we will call - later - as a distributional thesaurus network. The next step will be to turn it into dense word vectors and investigate the usefulness of distributional thesaurus embedding in improving overall word representation.
We have to take stock of what has happened so far:
We have news - written in English according many topics. This is a text format. Here is the micro level. Clearly speaking, this is the most valuable piece of information we have. For the time being, we did not use it at all. This is the topic under scrunity.
In addition, we have information regarding the context of news - when they are published, where they come from, from which category they fall down, who write them, … Here is a macro level. We introduced it just above.
The objectives is to play with these two levels in order to classify news at best.
News are coming from different fields - which are they? and whith wich proportion?
ggplot(news_data, aes(x=category, fill=category)) + geom_bar() + theme_bw()
In a nutshell, the DT network contains, for each word, a list of words that are similar with respect to their bigram distribution (Riedl and Biemann, 2013). In the network, each word is a node and there is a weighted edge between a pair of words where the weight corresponds to the number of overlapping features.
Each bigram is broken into a word and a feature, where the feature consists of the bigram relation and the related word. The word pairs having number of overlapping features above a threshold are retained in the network.
A sample snapshot of Distributional Thesaurus network where each node represents a word and the weight of an edge between two words is defined as the number of context features that these two words share in common.
general_news <- news_data %>%
filter(category == "general") %>%
select(description)
general_news_bigrams <- general_news %>%
unnest_tokens(bigram, description, token = "ngrams", n = 2) %>%
as.data.frame()
general_news_bigrams_counts<- general_news_bigrams %>%
count(bigram, sort = TRUE)
kable(general_news_bigrams_counts[c(1:10),]) %>%
kable_styling(bootstrap_options = "striped", full_width = F)
bigram | n |
---|---|
of the | 2227 |
in the | 2063 |
on the | 1518 |
to the | 1109 |
the most | 846 |
in a | 788 |
donald trump | 770 |
president donald | 752 |
ap â | 713 |
the internet | 686 |
As we can see there are a lot of StopWords such as ([“of the”, “is to”, “as a”, …]- they obviously bring no information and consequently they must be removed from the text.
bigrams_separated <- general_news_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
Here we keep only relevant bigrams and remove ones which have a weak occurence - we defined a theoretical threshold fixed at at least 50 times. In other words, we kept only bigrams which occurs frequently in a given text to put away those with a weak occurence.
bigram_counts %>%
filter(n >= 50) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void()
Let’s do the same but with sport news
Let’s do the same but with business news
We have collected different news coming from different fields: General, Business and Sport. It is quite natural to expect different words in keeping with news. For instance, within Sport news we should find words such as tennis, football or some specific words used by a sport - whereas in business news we should find words related to trading, finance, marketing, stock exchange, … The question here is to know how different news are each other?
If two words are close to the line - called bisector - in these plots, it means they have the same frequency in the two given text. In other words, words close to the line are used jointly in two categories of news.
library(scales)
library(ggplot2)
frequency <- readRDS("C:/Users/adsieg/Desktop/part_1/frequency.RDS")
# expect a warning about rows with missing values being removed
ggplot(frequency, aes(x = proportion, y = `business news`, color = abs(`business news` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
facet_wrap(~author, ncol = 2) +
theme(legend.position="none") +
labs(y = "business news", x = NULL) +
theme_bw() +
theme(legend.position="none")
## Warning: Removed 55043 rows containing missing values (geom_point).
## Warning: Removed 55043 rows containing missing values (geom_text).
Let’s have an example: Business news and general news share some words in common such as ‘president’, ‘America’, ‘bank’, ‘Donald’, ‘ceo’, ‘apple’, … whereas
Words that are far from the line are words that are found more in one set of texts than another. For instance, words such as ‘arsenal’, ‘chelsea’, ‘england’, or ‘donald’, ‘trump’,‘administration’, ‘jobs’, ‘data’ are closer than one given category rather than another one - either sport or business.
Let’s quantify how similar and different these sets of word frequencies are using a correlation test. How correlated are the word frequencies between business and the general news, and between sport and business news?
cor.test(data = frequency[frequency$author == "General news",],
~ proportion + `business news`)
##
## Pearson's product-moment correlation
##
## data: proportion and business news
## t = 161.79, df = 10456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8397572 0.8507015
## sample estimates:
## cor
## 0.845318
cor.test(data = frequency[frequency$author == "sport news",],
~ proportion + `business news`)
##
## Pearson's product-moment correlation
##
## data: proportion and business news
## t = 14.547, df = 5625, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1651054 0.2154703
## sample estimates:
## cor
## 0.1904131
From one dataset we can set up two graph databases:
The first one is related to news’ context (where they come from, when they are published, in which category they fall down, …). The context can provide with a first classification by drawing some links between news.
The second one is related to news themselves. Based on text, we can draw some links between words and get a classifcation of news in keeping with the meaning of news.
The big challenge of this paper is then to introduce how we are going to “analyze the effect of integrating the knowledge of Distributional Thesaurus network with the state-of-the-art word representation models to prepare a better word representation.” This underlying goal will be the topic of next tutorial.
We first prepare vector representations from Distributional Thesaurus (DT) network applying network
Next we combine this thesaurus embedding with state-of-theart vector representations prepared using GloVe and Word2vec model for analysis.
Combined representation of GloVe and DT embeddings shows promising performance gain over state-of-the-art embeddings
If you are interested in what we introduced, feel free to share and go to the next page [network_work_word_embeddings_part_2.html] :)