Search This Blog

Showing posts with label Elasticsearch. Show all posts
Showing posts with label Elasticsearch. Show all posts

Difference between "keyword" and "text" type in elastic search

Here we go with very simple example of statement "I'm running now!" and see how it will be splits into token and store into elastic search index.

statement : "I'm running now!"


if type="keyword" : [I'm running now!] - it will store in one token, you can not retrieve document it by matching 'running' or 'now' terms

if type="text": [i'm, running, now!] - it will store in three tokens, you can retrieve document by matching 'running' or 'now' terms

Text datatype

A field to index full-text values, such as the body of an email or the description of a product. These fields are analyzed, that is they are passed through an analyzer to convert the string into a list of individual terms before being indexed. The analysis process allows Elasticsearch to search for individual words within each full text field. Text fields are not used for sorting and seldom used for aggregations (although the significant text aggregation is a notable exception).

If you need to index structured content such as email addresses, hostnames, status codes, or tags, it is likely that you should rather use a keyword field.

Keyword datatype

A field to index structured content such as email addresses, hostnames, status codes, zip codes or tags.

They are typically used for filtering (Find me all blog posts where status is published), for sorting, and for aggregations. Keyword fields are only searchable by their exact value.

If you need to index full text content such as email bodies or product descriptions, it is likely that you should rather use a text field.

How do I find distinct values of date/ specific field based on date range in Elasticsearch

You need to do it this way, i.e. add the date range as a query to reduce the document set, and then run the terms aggregation only on the documents that fall into that date range:

Terms Aggregation

A multi-bucket value source based aggregation where buckets are dynamically built - one per unique value.

Date Range Aggregation

A range aggregation that is dedicated for date values. The main difference between this aggregation and the normal range aggregation is that the from and to values can be expressed in Date Math expressions, and it is also possible to specify a date format by which the from and to response fields will be returned. Note that this aggregation includes the from value and excludes the to value for each range.

POST index/_search?size=0
{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "tstamp": {
              "gte": 1591795757000,
              "lte" : 1591890413000
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "result": {
      "terms": {
        "field": "tstamp",
        "size":171
      }
    }
  }
}

Issue with running a Update By query elasticsearch API in AWS

welcome to the knowledge gaps that I’ve been screaming about for many moons!

                 Elasticsearch - How to add new field to existing document by update by query. Elasticsearch: Back-fill old documents with email, documents would be around 1M+ records.

You can use the update by query API in order to add a new field to all your existing documents:
POST index/_update_by_query?conflicts=proceed&scroll_size=500
{
  "script": {
    "source": "ctx._source.email = 'rohitpatel0105@gmail.com'",
    "lang": "painless"
  },
  "query": {
    "bool": {
      "must_not": [
        {
          "exists": {
            "field": "email"
          }
        }
      ]
    }
  }
} 

        While running update by query API around 700k records have been updated and suddenly CPU utilization went to maximum state and nodes started to go down with cluster status "Red".



the most likely root cause is exactly that - my UBQ … which hit too many documents too fast and destroyed the cluster. in addition to UBQ being SO easy to over-match on and update docs you didn’t mean to.{ memory and gc graphs are towards the bottom of those dashboards }

I would suggest that you use the _tasks api to first find and annihilate any remaining, running UBQ tasks ( which will keep running independent of whether you are still connected ) … or just wait out the storm and see if the cluster ever comes back.

If deleting a problematic index isn't feasible, you can restore a snapshot, delete documents from the index, change the index settings, reduce the number of replicas, or delete other indices to free up disk space. The important step is to resolve the red cluster status before re-configuring your Amazon ES domain. Re-configuring a domain with a red cluster status can compound the problem and lead to the domain being stuck in a configuration state of Processing until you resolve the status.

Conclusion: Don't try to use update by query API with large docs without closing index directly in cluster.

Elasticsearch - How to add new field to existing document by update by query

     Elasticsearch: Backfill old documents with email
 
 
You can use the update by query API in order to add a new field to all your existing documents:

POST index/_update_by_query
{
  "script": {
    "source": "ctx._source.email = 'rohitpatel0105@gmail.com'",
    "lang": "painless"
  },
  "query": {
    "bool": {
      "must_not": [
        {
          "exists": {
            "field": "email"
          }
        }
      ]
    }
  }
}

Are You A Thinker Or Maybe It's Overflowing?

Hi, everybody. I'm Rohit Patel and I'd like to ask you today, "Are you a thinker Or maybe it's overflowing? Do you live in ...

Popular Posts